---
title: Troubleshoot GPU Nodes
---

# Troubleshoot GPU Nodes

This page covers the NVIDIA driver, CUDA, and GPU-specific command-line issues you can hit on a GPU node. For everything else — node not accessible, networking, disk space, security groups, encryption, monitoring graphs missing, lifecycle actions — use the canonical [Node Troubleshooting](/docs/myaccount/node/troubleshoot/) guides. A GPU node is a regular node, so the same fixes apply.

If the symptom persists after the checks below, contact [cloud-platform@e2enetworks.com](mailto:cloud-platform@e2enetworks.com).

---

## `nvidia-smi` Does Not Work

### `nvidia-smi: command not found`

The NVIDIA driver utilities are not on the host's `PATH`.

```bash
which nvidia-smi
ls /usr/bin/nvidia-smi /usr/local/cuda/bin/nvidia-smi 2>/dev/null
dpkg -l | grep -i nvidia       # Ubuntu / Debian
rpm -qa | grep -i nvidia       # Rocky / RHEL
```

If no NVIDIA packages are installed, the host is missing the datacenter driver. Reinstall it:

```bash
# Ubuntu 22.04 / 24.04
sudo apt update
sudo apt install -y nvidia-driver-580-server   # or the branch your image targets
sudo reboot

# Rocky 9
sudo dnf install -y dnf-plugins-core
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf module install -y nvidia-driver:580-dkms
sudo reboot
```

Pick the driver branch that matches the CUDA version you need (570.x → CUDA 12.8, 580.x → CUDA 13.0). Current E2E GPU nodes run the 580.x branch.

### `NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver`

The userspace tool is present but the kernel module is not loaded (or the loaded module does not match the userspace version).

```bash
lsmod | grep nvidia
cat /proc/driver/nvidia/version 2>/dev/null
dmesg | grep -i nvidia | tail -40
```

Common causes and fixes:

| Cause                                                            | Fix                                                                                       |
| ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| Kernel was upgraded but the NVIDIA module was not rebuilt        | `sudo dkms autoinstall && sudo reboot`, or reinstall the matching `nvidia-driver-*` package. |
| Userspace driver version ≠ kernel module version                 | `apt reinstall nvidia-driver-<branch>-server` (Ubuntu) or `dnf reinstall nvidia-driver` (Rocky), then reboot. |
| Module is blacklisted (`nouveau` or `nvidia` in `/etc/modprobe.d`) | Remove the blacklist file, `sudo update-initramfs -u`, reboot.                            |
| Secure Boot rejected the unsigned module                         | Disable Secure Boot on the node or enroll the NVIDIA signing MOK key.                     |
| Driver was partially installed and `/var/log/nvidia-installer.log` shows errors | Purge and reinstall: `sudo apt purge '*nvidia*' && sudo apt autoremove`, then reinstall. |

After any driver reinstall, reboot and re-run `nvidia-smi`.

---

## GPU Card Not Detected

`nvidia-smi` runs but shows fewer cards than expected, or `No devices were found`.

```bash
lspci | grep -i nvidia
nvidia-smi -L
dmesg | grep -i -E 'nvidia|nvrm|xid' | tail -60
```

| Symptom                                                                   | Likely cause                                              | Fix                                                                                                       |
| ------------------------------------------------------------------------- | --------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| `lspci` shows the card but `nvidia-smi -L` does not                       | Driver loaded but failed to bind to the device.           | Reboot. If it persists, reinstall the driver and check `dmesg` for `Xid` errors.                          |
| Card count is lower than the plan (e.g., 4× plan shows 2 cards)            | One or more cards failed to initialize.                   | Reboot. If still missing, save an image and open a support ticket — this is a host-side issue.            |
| `Xid 79` (GPU fallen off the bus) in `dmesg`                              | Hardware fault or PCIe link reset.                        | Reboot; if the Xid returns, contact support — the host needs to be inspected.                             |
| `Xid 13`, `Xid 31`, `Xid 43`, `Xid 48` repeating                          | Application-side GPU memory or program fault.             | Restart the workload; if reproducible across nodes, debug the application (check kernel launch parameters and shared memory). |
| `RmInitAdapter failed` in `dmesg`                                         | Kernel module loaded against the wrong device topology.   | Reboot. If persistent, reinstall the driver.                                                              |

For repeated `Xid` errors with the same number, capture `nvidia-smi -q`, `dmesg`, and the saved-image ID before opening a support ticket. (`nvidia-smi -q -d ERROR` is not a valid flag; use `nvidia-smi -q -d ECC` to isolate ECC error counts specifically.)

---

## CUDA Version Issues

### Application reports a CUDA version that the driver does not support

`nvidia-smi` shows the **maximum** CUDA the driver supports. The application's CUDA must be ≤ that number.

```bash
nvidia-smi | head -3
nvcc --version 2>/dev/null
python -c "import torch; print(torch.version.cuda, torch.cuda.is_available())" 2>/dev/null
```

| Situation                                                              | Fix                                                                                                  |
| ---------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| App wants CUDA 12.4, `nvidia-smi` shows `CUDA Version: 12.2`           | Upgrade the driver to a branch that supports CUDA ≥ 12.4 (570.x or newer), then reboot.              |
| App wants CUDA 11.8, `nvidia-smi` shows `CUDA Version: 13.0`           | Forward-compatible. No driver change needed. If the app links against `libcudart.so.11.0`, install the CUDA 11.8 runtime libraries (the toolkit, not the driver). |
| `nvcc --version` differs from `nvidia-smi` CUDA Version                | Expected. `nvcc` shows the installed **toolkit**; `nvidia-smi` shows the driver's supported CUDA. The toolkit can be older. |
| `python -c "import torch"` returns `torch.cuda.is_available() == False` | Wrong wheel installed for the driver. Reinstall PyTorch matching the driver's CUDA — see the [PyTorch install matrix](https://pytorch.org/get-started/locally/). |

### `nvcc: command not found`

The CUDA **toolkit** is not installed on the host. The toolkit is only needed if you compile CUDA code on the host itself. On container-based images, install the toolkit **inside** the container instead.

```bash
# Ubuntu — add the NVIDIA CUDA apt repo first if not already configured:
# https://developer.nvidia.com/cuda-downloads (select Linux → x86_64 → Ubuntu → 24.04 → deb(network))
sudo apt install -y cuda-toolkit-12-8     # match driver branch; use cuda-toolkit-13-0 for driver 580.x
# or
sudo dnf install -y cuda-toolkit-12-8     # Rocky 9, after adding the CUDA repo
```

Add `/usr/local/cuda/bin` to `PATH` and `/usr/local/cuda/lib64` to `LD_LIBRARY_PATH` if not already present:

```bash
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
```

---

## GPU Out of Memory

```bash
nvidia-smi
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
```

| Cause                                            | Fix                                                                                                    |
| ------------------------------------------------ | ------------------------------------------------------------------------------------------------------ |
| A previous job's process is still holding memory | Kill the PID shown by `nvidia-smi` (`sudo kill <pid>`) or reboot the node.                             |
| Batch size or sequence length is too large       | Reduce batch size, enable gradient accumulation, or switch to a higher-memory card.                    |
| KV cache fills up at long context lengths        | Lower `max_model_len` / context window, reduce concurrency, or use a card with more memory.            |
| Memory leak — used memory grows without plateau  | Restart the workload; inspect for tensor accumulation across iterations.                               |
| Fragmentation after many allocations             | For PyTorch, set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` and restart the workload.          |

---

## Thermal Throttling and Power Capping

```bash
nvidia-smi --query-gpu=index,temperature.gpu,power.draw,power.limit,clocks_event_reasons.active --format=csv
nvidia-smi -q -d PERFORMANCE | head -40
```

:::note
On NVIDIA driver 570.x and earlier, the field was named `clocks_throttle_reasons.active`. Starting with driver 580.x it was renamed to `clocks_event_reasons.active`. The old name still works as an input alias but the CSV output header always shows the new name.
:::

| Sign                                                            | Meaning / Fix                                                                                  |
| --------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| `clocks_event_reasons.hw_thermal_slowdown: Active`             | Card is at or above its thermal limit. Reduce duty cycle, balance work across cards, or open a support ticket if the temperature is anomalous for the workload. |
| `clocks_event_reasons.hw_power_brake_slowdown: Active`         | Power capping. Check `power.limit` vs `power.draw`. If `power.limit` is lower than the card's stock TDP, reset it: `sudo nvidia-smi -pl <stock_watts>`. |
| Temperature sustained ≥85 °C                                    | Low cooling headroom. Coordinate with support — this is host-side.                             |
| Frequent clock drops with no thermal/power signal               | Workload is hitting idle periods. Profile with `nsys` or `nvprof` and fix the data pipeline.   |

---

## Persistence Mode and Slow First Iteration

If the first call after a long idle period is unusually slow, the driver may be unloading and reloading between processes.

```bash
nvidia-smi -q | grep Persistence
sudo nvidia-smi -pm 1     # enable persistence mode on all cards
```

Set persistence mode to `Enabled` on long-running inference and training nodes. Add `nvidia-smi -pm 1` to a start script or systemd unit so it survives reboots.

### `systemctl enable nvidia-persistenced` Returns "no installation config" Error

On Ubuntu 24.04 with NVIDIA driver 580.x, `nvidia-persistenced.service` is a **static** systemd unit. It starts automatically at boot as a dependency of the NVIDIA driver — you do not need to enable it.

```bash
systemctl is-active nvidia-persistenced   # should print "active"
```

If the output is `active`, persistence mode is already managed by the daemon. Simply run `sudo nvidia-smi -pm 1` to enable it for the current session; the daemon will persist the setting across reboots. Running `systemctl enable nvidia-persistenced` will return the error above and can be safely ignored.

On driver 570.x and earlier the unit is not static, so `systemctl enable` is required there.

---

## Docker Cannot Access the GPU

```bash
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
```

| Error                                                                  | Fix                                                                                                       |
| ---------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| `docker: command not found`                                            | Docker is not pre-installed on Ubuntu 24.04-based GPU images. Install it: `sudo apt install -y docker.io` |
| `unknown flag: --gpus`                                                 | Docker is too old. Upgrade Docker Engine to a current release.                                            |
| `could not select device driver "" with capabilities: [[gpu]]`         | NVIDIA Container Toolkit is missing or not configured. Install and configure it:                          |
| `nvidia-container-cli: initialization error`                           | Driver is broken on the host. Fix `nvidia-smi` on the host first (see above).                             |

Install or repair the NVIDIA Container Toolkit:

```bash
# Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
```

Re-run the test container after the install completes.

---

## Windows GPU Node: Driver Issues

### Device Manager Shows `Microsoft Basic Display Adapter`

The NVIDIA driver is not installed or did not load. Open PowerShell:

```powershell
Get-PnpDevice -Class Display
nvidia-smi
```

If `nvidia-smi` reports the same "couldn't communicate with the NVIDIA driver" message as Linux, reinstall the driver from the [NVIDIA Datacenter Driver](https://www.nvidia.com/Download/Find.aspx) page for the card and Windows Server version.

### Switching Between WDDM and TCC

For RDP-rendered visualization, the card must be in **WDDM** mode. For headless compute, **TCC** can give slightly better performance but disables the display path.

```powershell
nvidia-smi -dm 0     # WDDM (display + compute)
nvidia-smi -dm 1     # TCC (compute only — RDP desktop will not render through this GPU)
```

Reboot after changing the mode.

---

## When to Open a Support Ticket

Gather this before contacting support — it makes diagnosis significantly faster:

```bash
nvidia-smi -q > nvidia-smi-q.txt
dmesg | grep -i -E 'nvidia|nvrm|xid' > nvidia-dmesg.txt
uname -a > sysinfo.txt
cat /etc/os-release >> sysinfo.txt
cat /proc/driver/nvidia/version >> sysinfo.txt 2>/dev/null
```

Attach those three files plus the GPU node ID, the saved-image ID (if any), and the time range of the failure.

---

## Related Resources

| Resource                                                                                 | Use it for                                                       |
| ---------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| [Node Troubleshooting](/docs/myaccount/node/troubleshoot/)                               | Networking, disk, security, monitoring graphs, lifecycle issues. |
| [Node Not Accessible](/docs/myaccount/node/troubleshoot/node-not-accessible)             | Can't SSH or RDP into the node at all.                           |
| [Disk Space Issues](/docs/myaccount/node/troubleshoot/disk-space)                        | GPU images and model weights fill the root disk fast.            |
| [Connect to a Linux GPU node](/docs/myaccount/gpu/connect-to-gpu/linux-gpu-node)         | Baseline `nvidia-smi` check after launch.                        |
| [Connect to a Windows GPU node](/docs/myaccount/gpu/connect-to-gpu/windows-gpu-node)     | Baseline driver check on Windows.                                |
| [Manage GPU Nodes](/docs/myaccount/gpu/manage)                                           | GPU-specific lifecycle and monitoring differences.               |