Troubleshoot GPU Nodes

This page covers the NVIDIA driver, CUDA, and GPU-specific command-line issues you can hit on a GPU node. For everything else - node not accessible, networking, disk space, security groups, encryption, monitoring graphs missing, lifecycle actions - use the canonical Node Troubleshooting guides. A GPU node is a regular node, so the same fixes apply.

If the symptom persists after the checks below, contact cloud-platform@e2enetworks.com.

nvidia-smi Card not detected CUDA issues Out of memory Thermal throttling Persistence mode Docker GPU access Windows drivers

`nvidia-smi` Does Not Work

`nvidia-smi: command not found`

The NVIDIA driver utilities are not on the host's PATH.

which nvidia-smi
ls /usr/bin/nvidia-smi /usr/local/cuda/bin/nvidia-smi 2>/dev/null
dpkg -l | grep -i nvidia       # Ubuntu / Debian
rpm -qa | grep -i nvidia       # Rocky / RHEL

If no NVIDIA packages are installed, the host is missing the datacenter driver. Reinstall it:

# Ubuntu 22.04 / 24.04
sudo apt update
sudo apt install -y nvidia-driver-580-server   # or the branch your image targets
sudo reboot

# Rocky 9
sudo dnf install -y dnf-plugins-core
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf module install -y nvidia-driver:580-dkms
sudo reboot

Pick the driver branch that matches the CUDA version you need (570.x → CUDA 12.8, 580.x → CUDA 13.0). Current E2E GPU nodes run the 580.x branch.

`NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver`

The userspace tool is present but the kernel module is not loaded (or the loaded module does not match the userspace version).

lsmod | grep nvidia
cat /proc/driver/nvidia/version 2>/dev/null
dmesg | grep -i nvidia | tail -40

Common causes and fixes:

Cause	Fix
Kernel was upgraded but the NVIDIA module was not rebuilt	`sudo dkms autoinstall && sudo reboot`, or reinstall the matching `nvidia-driver-*` package.
Userspace driver version ≠ kernel module version	`apt reinstall nvidia-driver-<branch>-server` (Ubuntu) or `dnf reinstall nvidia-driver` (Rocky), then reboot.
Module is blacklisted (`nouveau` or `nvidia` in `/etc/modprobe.d`)	Remove the blacklist file, `sudo update-initramfs -u`, reboot.
Secure Boot rejected the unsigned module	Disable Secure Boot on the node or enroll the NVIDIA signing MOK key.
Driver was partially installed and `/var/log/nvidia-installer.log` shows errors	Purge and reinstall: `sudo apt purge 'nvidia' && sudo apt autoremove`, then reinstall.

After any driver reinstall, reboot and re-run nvidia-smi.

GPU Card Not Detected

nvidia-smi runs but shows fewer cards than expected, or No devices were found.

lspci | grep -i nvidia
nvidia-smi -L
dmesg | grep -i -E 'nvidia|nvrm|xid' | tail -60

Symptom	Likely cause	Fix
`lspci` shows the card but `nvidia-smi -L` does not	Driver loaded but failed to bind to the device.	Reboot. If it persists, reinstall the driver and check `dmesg` for `Xid` errors.
Card count is lower than the plan (e.g., 4× plan shows 2 cards)	One or more cards failed to initialize.	Reboot. If still missing, save an image and open a support ticket - this is a host-side issue.
`Xid 79` (GPU fallen off the bus) in `dmesg`	Hardware fault or PCIe link reset.	Reboot; if the Xid returns, contact support - the host needs to be inspected.
`Xid 13`, `Xid 31`, `Xid 43`, `Xid 48` repeating	Application-side GPU memory or program fault.	Restart the workload; if reproducible across nodes, debug the application (check kernel launch parameters and shared memory).
`RmInitAdapter failed` in `dmesg`	Kernel module loaded against the wrong device topology.	Reboot. If persistent, reinstall the driver.

For repeated Xid errors with the same number, capture nvidia-smi -q, dmesg, and the saved-image ID before opening a support ticket. (nvidia-smi -q -d ERROR is not a valid flag; use nvidia-smi -q -d ECC to isolate ECC error counts specifically.)

CUDA Version Issues

Application reports a CUDA version that the driver does not support

nvidia-smi shows the maximum CUDA the driver supports. The application's CUDA must be ≤ that number.

nvidia-smi | head -3
nvcc --version 2>/dev/null
python -c "import torch; print(torch.version.cuda, torch.cuda.is_available())" 2>/dev/null

Situation	Fix
App wants CUDA 12.4, `nvidia-smi` shows `CUDA Version: 12.2`	Upgrade the driver to a branch that supports CUDA ≥ 12.4 (570.x or newer), then reboot.
App wants CUDA 11.8, `nvidia-smi` shows `CUDA Version: 13.0`	Forward-compatible. No driver change needed. If the app links against `libcudart.so.11.0`, install the CUDA 11.8 runtime libraries (the toolkit, not the driver).
`nvcc --version` differs from `nvidia-smi` CUDA Version	Expected. `nvcc` shows the installed toolkit; `nvidia-smi` shows the driver's supported CUDA. The toolkit can be older.
`python -c "import torch"` returns `torch.cuda.is_available() == False`	Wrong wheel installed for the driver. Reinstall PyTorch matching the driver's CUDA - see the PyTorch install matrix.

`nvcc: command not found`

The CUDA toolkit is not installed on the host. The toolkit is only needed if you compile CUDA code on the host itself. On container-based images, install the toolkit inside the container instead.

# Ubuntu - add the NVIDIA CUDA apt repo first if not already configured:
# https://developer.nvidia.com/cuda-downloads (select Linux → x86_64 → Ubuntu → 24.04 → deb(network))
sudo apt install -y cuda-toolkit-12-8     # match driver branch; use cuda-toolkit-13-0 for driver 580.x
# or
sudo dnf install -y cuda-toolkit-12-8     # Rocky 9, after adding the CUDA repo

Add /usr/local/cuda/bin to PATH and /usr/local/cuda/lib64 to LD_LIBRARY_PATH if not already present:

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

GPU Out of Memory

nvidia-smi
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

Cause	Fix
A previous job's process is still holding memory	Kill the PID shown by `nvidia-smi` (`sudo kill <pid>`) or reboot the node.
Batch size or sequence length is too large	Reduce batch size, enable gradient accumulation, or switch to a higher-memory card.
KV cache fills up at long context lengths	Lower `max_model_len` / context window, reduce concurrency, or use a card with more memory.
Memory leak - used memory grows without plateau	Restart the workload; inspect for tensor accumulation across iterations.
Fragmentation after many allocations	For PyTorch, set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` and restart the workload.

Thermal Throttling and Power Capping

nvidia-smi --query-gpu=index,temperature.gpu,power.draw,power.limit,clocks_event_reasons.active --format=csv
nvidia-smi -q -d PERFORMANCE | head -40

note

On NVIDIA driver 570.x and earlier, the field was named clocks_throttle_reasons.active. Starting with driver 580.x it was renamed to clocks_event_reasons.active. The old name still works as an input alias but the CSV output header always shows the new name.

Sign	Meaning / Fix
`clocks_event_reasons.hw_thermal_slowdown: Active`	Card is at or above its thermal limit. Reduce duty cycle, balance work across cards, or open a support ticket if the temperature is anomalous for the workload.
`clocks_event_reasons.hw_power_brake_slowdown: Active`	Power capping. Check `power.limit` vs `power.draw`. If `power.limit` is lower than the card's stock TDP, reset it: `sudo nvidia-smi -pl <stock_watts>`.
Temperature sustained ≥85 °C	Low cooling headroom. Coordinate with support - this is host-side.
Frequent clock drops with no thermal/power signal	Workload is hitting idle periods. Profile with `nsys` or `nvprof` and fix the data pipeline.

Persistence Mode and Slow First Iteration

If the first call after a long idle period is unusually slow, the driver may be unloading and reloading between processes.

nvidia-smi -q | grep Persistence
sudo nvidia-smi -pm 1     # enable persistence mode on all cards

Set persistence mode to Enabled on long-running inference and training nodes. Add nvidia-smi -pm 1 to a start script or systemd unit so it survives reboots.

`systemctl enable nvidia-persistenced` Returns "no installation config" Error

On Ubuntu 24.04 with NVIDIA driver 580.x, nvidia-persistenced.service is a static systemd unit. It starts automatically at boot as a dependency of the NVIDIA driver - you do not need to enable it.

systemctl is-active nvidia-persistenced   # should print "active"

If the output is active, persistence mode is already managed by the daemon. Simply run sudo nvidia-smi -pm 1 to enable it for the current session; the daemon will persist the setting across reboots. Running systemctl enable nvidia-persistenced will return the error above and can be safely ignored.

On driver 570.x and earlier the unit is not static, so systemctl enable is required there.

Docker Cannot Access the GPU

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Error	Fix
`docker: command not found`	Docker is not pre-installed on Ubuntu 24.04-based GPU images. Install it: `sudo apt install -y docker.io`
`unknown flag: --gpus`	Docker is too old. Upgrade Docker Engine to a current release.
`could not select device driver "" with capabilities: [[gpu]]`	NVIDIA Container Toolkit is missing or not configured. Install and configure it:
`nvidia-container-cli: initialization error`	Driver is broken on the host. Fix `nvidia-smi` on the host first (see above).

Install or repair the NVIDIA Container Toolkit:

# Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Re-run the test container after the install completes.

Windows GPU Node: Driver Issues

Device Manager Shows `Microsoft Basic Display Adapter`

The NVIDIA driver is not installed or did not load. Open PowerShell:

Get-PnpDevice -Class Display
nvidia-smi

If nvidia-smi reports the same "couldn't communicate with the NVIDIA driver" message as Linux, reinstall the driver from the NVIDIA Datacenter Driver page for the card and Windows Server version.

Switching Between WDDM and TCC

For RDP-rendered visualization, the card must be in WDDM mode. For headless compute, TCC can give slightly better performance but disables the display path.

nvidia-smi -dm 0     # WDDM (display + compute)
nvidia-smi -dm 1     # TCC (compute only - RDP desktop will not render through this GPU)

Reboot after changing the mode.

When to Open a Support Ticket

Gather this before contacting support - it makes diagnosis significantly faster:

nvidia-smi -q > nvidia-smi-q.txt
dmesg | grep -i -E 'nvidia|nvrm|xid' > nvidia-dmesg.txt
uname -a > sysinfo.txt
cat /etc/os-release >> sysinfo.txt
cat /proc/driver/nvidia/version >> sysinfo.txt 2>/dev/null

Attach those three files plus the GPU node ID, the saved-image ID (if any), and the time range of the failure.

Resource	Use it for
Node Troubleshooting	Networking, disk, security, monitoring graphs, lifecycle issues.
Node Not Accessible	Can't SSH or RDP into the node at all.
Disk Space Issues	GPU images and model weights fill the root disk fast.
Connect to a Linux GPU node	Baseline `nvidia-smi` check after launch.
Connect to a Windows GPU node	Baseline driver check on Windows.
Manage GPU Nodes	GPU-specific lifecycle and monitoring differences.

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on June 26, 2026.

nvidia-smi Does Not Work​

nvidia-smi: command not found​

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver​

GPU Card Not Detected​

CUDA Version Issues​

Application reports a CUDA version that the driver does not support​

nvcc: command not found​

GPU Out of Memory​

Thermal Throttling and Power Capping​

Persistence Mode and Slow First Iteration​

systemctl enable nvidia-persistenced Returns "no installation config" Error​

Docker Cannot Access the GPU​

Windows GPU Node: Driver Issues​

Device Manager Shows Microsoft Basic Display Adapter​

Switching Between WDDM and TCC​

When to Open a Support Ticket​

Related Resources​

`nvidia-smi` Does Not Work

`nvidia-smi: command not found`

`NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver`

GPU Card Not Detected

CUDA Version Issues

Application reports a CUDA version that the driver does not support

`nvcc: command not found`

GPU Out of Memory

Thermal Throttling and Power Capping

Persistence Mode and Slow First Iteration

`systemctl enable nvidia-persistenced` Returns "no installation config" Error

Docker Cannot Access the GPU

Windows GPU Node: Driver Issues

Device Manager Shows `Microsoft Basic Display Adapter`

Switching Between WDDM and TCC

When to Open a Support Ticket

Related Resources