NodeNotReady
This document provides a standard operating procedure (SOP) to identify, diagnose, and resolve NodeNotReady issues in E2E Kubernetes clusters.
Overview
In Kubernetes, a node is marked NotReady when the control plane stops receiving healthy status signals from the node.
When a node goes into NodeNotReady state:
- Pods stop scheduling on that node
- Existing pods may get evicted
- Cluster capacity reduces silently (very dangerous in production)
Why It Breaks (Common Causes)
From real production incidents, 90% of the time it's due to:
- kubelet is down or stuck
- Node lost network connectivity (to control plane)
- DiskPressure / MemoryPressure
- Container runtime failure (Docker / containerd)
- Node clock skew (NTP issues)
Step 1: Identify Affected Nodes
kubectl get nodes
Step 2: Describe the Node (Primary Diagnostic)
kubectl describe node <node-name>
Focus on Conditions:
| Condition | Meaning |
|---|---|
| KubeletNotReady | kubelet unhealthy or not reporting |
| NetworkUnavailable | CNI / network connectivity issue |
| DiskPressure | Disk space exhausted |
| MemoryPressure | Node out of memory |
| PIDPressure | Process limit reached |
Common Causes & Resolutions
1. DiskPressure (Disk Space Exhaustion)
How It Happens
- Excessive application logs
- No log rotation
- emptyDir volumes consuming space
- Image cache growth
How to Identify
kubectl describe node <node-name>
Look for:
DiskPressure=True
Check ephemeral-storage under Capacity / Allocatable.
2. kubelet Not Running or Stuck
Symptoms
Ready=FalseKubeletNotReady
Identification
kubectl describe node <node-name>
Resolution
- In E2E-managed clusters, restart is handled by the platform
- If persistent, raise a node-level support request
3. Network Connectivity Loss
Symptoms
NetworkUnavailable=True- Multiple nodes NotReady simultaneously
Common Causes
- Firewall or security group change
- Control plane connectivity loss
- CNI failure
Resolution
- Verify control-plane reachability
- Roll back recent network changes
- Escalate to network team if needed
4. MemoryPressure
Symptoms
- Pods evicted
- Node marked NotReady
Identification
kubectl describe node <node-name>
Look for:
MemoryPressure=True
Resolution
- Scale down memory-heavy workloads
- Fix memory leaks
- Add memory requests and limits
5. Container Runtime Failure
Symptoms
- Pods failing across the node
- Runtime errors in events
Common Causes
containerdcrash- Image filesystem corruption
Resolution
- Platform-managed restart
- Node replacement if persistent
Step 3: Check Events (Evidence)
kubectl get events -A --sort-by=.lastTimestamp
Useful events to look for:
NodeHasDiskPressureEvictionThresholdMetNodeNotReady
Step 4: Recovery Validation
kubectl get nodes
Expected:
<node-name> Ready
Common Mistakes
- Restarting kubelet blindly
- Ignoring DiskPressure
- Waiting for SSH access
- Treating NodeNotReady as the root issue
E2E Best Practices
- Monitor node conditions continuously
- Set resource requests & limits
- Configure log rotation at app level
- Use centralized logging
- Alert on
NodeNotReadyandDiskPressure
Final Note
NodeNotReady is a signal, not a failure. Fix the cause, and the node will recover automatically.