# NodeNotReady This document provides a standard operating procedure (SOP) to identify, diagnose, and resolve **NodeNotReady** issues in E2E Kubernetes clusters. ## Overview In Kubernetes, a node is marked **NotReady** when the control plane stops receiving healthy status signals from the node. When a node goes into `NodeNotReady` state: - Pods stop scheduling on that node - Existing pods may get evicted - Cluster capacity reduces silently (very dangerous in production) ### Why It Breaks (Common Causes) From real production incidents, **90%** of the time it's due to: - **kubelet is down or stuck** - **Node lost network connectivity** (to control plane) - **DiskPressure / MemoryPressure** - **Container runtime failure** (Docker / containerd) - **Node clock skew** (NTP issues) --- ## Step 1: Identify Affected Nodes ```bash kubectl get nodes ``` --- ## Step 2: Describe the Node (Primary Diagnostic) ```bash kubectl describe node ``` Focus on **Conditions**: | Condition | Meaning | |---|---| | KubeletNotReady | kubelet unhealthy or not reporting | | NetworkUnavailable | CNI / network connectivity issue | | DiskPressure | Disk space exhausted | | MemoryPressure | Node out of memory | | PIDPressure | Process limit reached | --- ## Common Causes & Resolutions ### 1. DiskPressure (Disk Space Exhaustion) #### How It Happens - Excessive application logs - No log rotation - emptyDir volumes consuming space - Image cache growth #### How to Identify ```bash kubectl describe node ``` Look for: ``` DiskPressure=True ``` Check **ephemeral-storage** under Capacity / Allocatable. --- ### 2. kubelet Not Running or Stuck #### Symptoms - `Ready=False` - `KubeletNotReady` #### Identification ```bash kubectl describe node ``` #### Resolution - In E2E-managed clusters, restart is handled by the platform - If persistent, raise a **node-level support request** --- ### 3. Network Connectivity Loss #### Symptoms - `NetworkUnavailable=True` - Multiple nodes NotReady simultaneously #### Common Causes - Firewall or security group change - Control plane connectivity loss - CNI failure #### Resolution - Verify control-plane reachability - Roll back recent network changes - Escalate to network team if needed --- ### 4. MemoryPressure #### Symptoms - Pods evicted - Node marked NotReady #### Identification ```bash kubectl describe node ``` Look for: ``` MemoryPressure=True ``` #### Resolution - Scale down memory-heavy workloads - Fix memory leaks - Add memory requests and limits --- ### 5. Container Runtime Failure #### Symptoms - Pods failing across the node - Runtime errors in events #### Common Causes - `containerd` crash - Image filesystem corruption #### Resolution - Platform-managed restart - Node replacement if persistent --- ## Step 3: Check Events (Evidence) ```bash kubectl get events -A --sort-by=.lastTimestamp ``` Useful events to look for: - `NodeHasDiskPressure` - `EvictionThresholdMet` - `NodeNotReady` --- ## Step 4: Recovery Validation ```bash kubectl get nodes ``` Expected: ``` Ready ``` --- ## Common Mistakes - Restarting kubelet blindly - Ignoring DiskPressure - Waiting for SSH access - Treating NodeNotReady as the root issue --- ## E2E Best Practices - Monitor node conditions continuously - Set resource requests & limits - Configure log rotation at app level - Use centralized logging - Alert on `NodeNotReady` and `DiskPressure` --- ## Final Note `NodeNotReady` is a **signal**, not a failure. Fix the **cause**, and the node will recover automatically. ---