Skip to main content

NodeNotReady

This document provides a standard operating procedure (SOP) to identify, diagnose, and resolve NodeNotReady issues in E2E Kubernetes clusters.

Overview

In Kubernetes, a node is marked NotReady when the control plane stops receiving healthy status signals from the node.

When a node goes into NodeNotReady state:

  • Pods stop scheduling on that node
  • Existing pods may get evicted
  • Cluster capacity reduces silently (very dangerous in production)

Why It Breaks (Common Causes)

From real production incidents, 90% of the time it's due to:

  • kubelet is down or stuck
  • Node lost network connectivity (to control plane)
  • DiskPressure / MemoryPressure
  • Container runtime failure (Docker / containerd)
  • Node clock skew (NTP issues)

Step 1: Identify Affected Nodes

kubectl get nodes

Step 2: Describe the Node (Primary Diagnostic)

kubectl describe node <node-name>

Focus on Conditions:

ConditionMeaning
KubeletNotReadykubelet unhealthy or not reporting
NetworkUnavailableCNI / network connectivity issue
DiskPressureDisk space exhausted
MemoryPressureNode out of memory
PIDPressureProcess limit reached

Common Causes & Resolutions

1. DiskPressure (Disk Space Exhaustion)

How It Happens

  • Excessive application logs
  • No log rotation
  • emptyDir volumes consuming space
  • Image cache growth

How to Identify

kubectl describe node <node-name>

Look for:

DiskPressure=True

Check ephemeral-storage under Capacity / Allocatable.


2. kubelet Not Running or Stuck

Symptoms

  • Ready=False
  • KubeletNotReady

Identification

kubectl describe node <node-name>

Resolution

  • In E2E-managed clusters, restart is handled by the platform
  • If persistent, raise a node-level support request

3. Network Connectivity Loss

Symptoms

  • NetworkUnavailable=True
  • Multiple nodes NotReady simultaneously

Common Causes

  • Firewall or security group change
  • Control plane connectivity loss
  • CNI failure

Resolution

  • Verify control-plane reachability
  • Roll back recent network changes
  • Escalate to network team if needed

4. MemoryPressure

Symptoms

  • Pods evicted
  • Node marked NotReady

Identification

kubectl describe node <node-name>

Look for:

MemoryPressure=True

Resolution

  • Scale down memory-heavy workloads
  • Fix memory leaks
  • Add memory requests and limits

5. Container Runtime Failure

Symptoms

  • Pods failing across the node
  • Runtime errors in events

Common Causes

  • containerd crash
  • Image filesystem corruption

Resolution

  • Platform-managed restart
  • Node replacement if persistent

Step 3: Check Events (Evidence)

kubectl get events -A --sort-by=.lastTimestamp

Useful events to look for:

  • NodeHasDiskPressure
  • EvictionThresholdMet
  • NodeNotReady

Step 4: Recovery Validation

kubectl get nodes

Expected:

<node-name>   Ready

Common Mistakes

  • Restarting kubelet blindly
  • Ignoring DiskPressure
  • Waiting for SSH access
  • Treating NodeNotReady as the root issue

E2E Best Practices

  • Monitor node conditions continuously
  • Set resource requests & limits
  • Configure log rotation at app level
  • Use centralized logging
  • Alert on NodeNotReady and DiskPressure

Final Note

NodeNotReady is a signal, not a failure. Fix the cause, and the node will recover automatically.