# NodeNotReady

This document provides a standard operating procedure (SOP) to identify, diagnose, and resolve **NodeNotReady** issues in E2E Kubernetes clusters.

## Overview

In Kubernetes, a node is marked **NotReady** when the control plane stops receiving healthy status signals from the node.

When a node goes into `NodeNotReady` state:

- Pods stop scheduling on that node
- Existing pods may get evicted
- Cluster capacity reduces silently (very dangerous in production)

### Why It Breaks (Common Causes)

From real production incidents, **90%** of the time it's due to:

- **kubelet is down or stuck**
- **Node lost network connectivity** (to control plane)
- **DiskPressure / MemoryPressure**
- **Container runtime failure** (Docker / containerd)
- **Node clock skew** (NTP issues)

---

## Step 1: Identify Affected Nodes

```bash
kubectl get nodes
```

---

## Step 2: Describe the Node (Primary Diagnostic)

```bash
kubectl describe node <node-name>
```

Focus on **Conditions**:

| Condition | Meaning |
|---|---|
| KubeletNotReady | kubelet unhealthy or not reporting |
| NetworkUnavailable | CNI / network connectivity issue |
| DiskPressure | Disk space exhausted |
| MemoryPressure | Node out of memory |
| PIDPressure | Process limit reached |

---

## Common Causes & Resolutions

### 1. DiskPressure (Disk Space Exhaustion)

#### How It Happens

- Excessive application logs
- No log rotation
- emptyDir volumes consuming space
- Image cache growth

#### How to Identify

```bash
kubectl describe node <node-name>
```

Look for:

```
DiskPressure=True
```

Check **ephemeral-storage** under Capacity / Allocatable.

---

### 2. kubelet Not Running or Stuck

#### Symptoms

- `Ready=False`
- `KubeletNotReady`

#### Identification

```bash
kubectl describe node <node-name>
```

#### Resolution

- In E2E-managed clusters, restart is handled by the platform
- If persistent, raise a **node-level support request**

---

### 3. Network Connectivity Loss

#### Symptoms

- `NetworkUnavailable=True`
- Multiple nodes NotReady simultaneously

#### Common Causes

- Firewall or security group change
- Control plane connectivity loss
- CNI failure

#### Resolution

- Verify control-plane reachability
- Roll back recent network changes
- Escalate to network team if needed

---

### 4. MemoryPressure

#### Symptoms

- Pods evicted
- Node marked NotReady

#### Identification

```bash
kubectl describe node <node-name>
```

Look for:

```
MemoryPressure=True
```

#### Resolution

- Scale down memory-heavy workloads
- Fix memory leaks
- Add memory requests and limits

---

### 5. Container Runtime Failure

#### Symptoms

- Pods failing across the node
- Runtime errors in events

#### Common Causes

- `containerd` crash
- Image filesystem corruption

#### Resolution

- Platform-managed restart
- Node replacement if persistent

---

## Step 3: Check Events (Evidence)

```bash
kubectl get events -A --sort-by=.lastTimestamp
```

Useful events to look for:

- `NodeHasDiskPressure`
- `EvictionThresholdMet`
- `NodeNotReady`

---

## Step 4: Recovery Validation

```bash
kubectl get nodes
```

Expected:

```
<node-name>   Ready
```

---

## Common Mistakes

- Restarting kubelet blindly
- Ignoring DiskPressure
- Waiting for SSH access
- Treating NodeNotReady as the root issue

---

## E2E Best Practices

- Monitor node conditions continuously
- Set resource requests & limits
- Configure log rotation at app level
- Use centralized logging
- Alert on `NodeNotReady` and `DiskPressure`

---

## Final Note

`NodeNotReady` is a **signal**, not a failure. Fix the **cause**, and the node will recover automatically.


---