Skip to main content

Troubleshoot GPU Nodes

This page covers the NVIDIA driver, CUDA, and GPU-specific command-line issues you can hit on a GPU node. For everything else — node not accessible, networking, disk space, security groups, encryption, monitoring graphs missing, lifecycle actions — use the canonical Node Troubleshooting guides. A GPU node is a regular node, so the same fixes apply.

If the symptom persists after the checks below, contact cloud-platform@e2enetworks.com.


nvidia-smi Does Not Work

nvidia-smi: command not found

The NVIDIA driver utilities are not on the host's PATH.

which nvidia-smi
ls /usr/bin/nvidia-smi /usr/local/cuda/bin/nvidia-smi 2>/dev/null
dpkg -l | grep -i nvidia # Ubuntu / Debian
rpm -qa | grep -i nvidia # Rocky / RHEL

If no NVIDIA packages are installed, the host is missing the datacenter driver. Reinstall it:

# Ubuntu 22.04 / 24.04
sudo apt update
sudo apt install -y nvidia-driver-580-server # or the branch your image targets
sudo reboot

# Rocky 9
sudo dnf install -y dnf-plugins-core
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf module install -y nvidia-driver:580-dkms
sudo reboot

Pick the driver branch that matches the CUDA version you need (570.x → CUDA 12.8, 580.x → CUDA 13.0). Current E2E GPU nodes run the 580.x branch.

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

The userspace tool is present but the kernel module is not loaded (or the loaded module does not match the userspace version).

lsmod | grep nvidia
cat /proc/driver/nvidia/version 2>/dev/null
dmesg | grep -i nvidia | tail -40

Common causes and fixes:

CauseFix
Kernel was upgraded but the NVIDIA module was not rebuiltsudo dkms autoinstall && sudo reboot, or reinstall the matching nvidia-driver-* package.
Userspace driver version ≠ kernel module versionapt reinstall nvidia-driver-<branch>-server (Ubuntu) or dnf reinstall nvidia-driver (Rocky), then reboot.
Module is blacklisted (nouveau or nvidia in /etc/modprobe.d)Remove the blacklist file, sudo update-initramfs -u, reboot.
Secure Boot rejected the unsigned moduleDisable Secure Boot on the node or enroll the NVIDIA signing MOK key.
Driver was partially installed and /var/log/nvidia-installer.log shows errorsPurge and reinstall: sudo apt purge '*nvidia*' && sudo apt autoremove, then reinstall.

After any driver reinstall, reboot and re-run nvidia-smi.


GPU Card Not Detected

nvidia-smi runs but shows fewer cards than expected, or No devices were found.

lspci | grep -i nvidia
nvidia-smi -L
dmesg | grep -i -E 'nvidia|nvrm|xid' | tail -60
SymptomLikely causeFix
lspci shows the card but nvidia-smi -L does notDriver loaded but failed to bind to the device.Reboot. If it persists, reinstall the driver and check dmesg for Xid errors.
Card count is lower than the plan (e.g., 4× plan shows 2 cards)One or more cards failed to initialize.Reboot. If still missing, save an image and open a support ticket — this is a host-side issue.
Xid 79 (GPU fallen off the bus) in dmesgHardware fault or PCIe link reset.Reboot; if the Xid returns, contact support — the host needs to be inspected.
Xid 13, Xid 31, Xid 43, Xid 48 repeatingApplication-side GPU memory or program fault.Restart the workload; if reproducible across nodes, debug the application (check kernel launch parameters and shared memory).
RmInitAdapter failed in dmesgKernel module loaded against the wrong device topology.Reboot. If persistent, reinstall the driver.

For repeated Xid errors with the same number, capture nvidia-smi -q, dmesg, and the saved-image ID before opening a support ticket. (nvidia-smi -q -d ERROR is not a valid flag; use nvidia-smi -q -d ECC to isolate ECC error counts specifically.)


CUDA Version Issues

Application reports a CUDA version that the driver does not support

nvidia-smi shows the maximum CUDA the driver supports. The application's CUDA must be ≤ that number.

nvidia-smi | head -3
nvcc --version 2>/dev/null
python -c "import torch; print(torch.version.cuda, torch.cuda.is_available())" 2>/dev/null
SituationFix
App wants CUDA 12.4, nvidia-smi shows CUDA Version: 12.2Upgrade the driver to a branch that supports CUDA ≥ 12.4 (570.x or newer), then reboot.
App wants CUDA 11.8, nvidia-smi shows CUDA Version: 13.0Forward-compatible. No driver change needed. If the app links against libcudart.so.11.0, install the CUDA 11.8 runtime libraries (the toolkit, not the driver).
nvcc --version differs from nvidia-smi CUDA VersionExpected. nvcc shows the installed toolkit; nvidia-smi shows the driver's supported CUDA. The toolkit can be older.
python -c "import torch" returns torch.cuda.is_available() == FalseWrong wheel installed for the driver. Reinstall PyTorch matching the driver's CUDA — see the PyTorch install matrix.

nvcc: command not found

The CUDA toolkit is not installed on the host. The toolkit is only needed if you compile CUDA code on the host itself. On container-based images, install the toolkit inside the container instead.

# Ubuntu — add the NVIDIA CUDA apt repo first if not already configured:
# https://developer.nvidia.com/cuda-downloads (select Linux → x86_64 → Ubuntu → 24.04 → deb(network))
sudo apt install -y cuda-toolkit-12-8 # match driver branch; use cuda-toolkit-13-0 for driver 580.x
# or
sudo dnf install -y cuda-toolkit-12-8 # Rocky 9, after adding the CUDA repo

Add /usr/local/cuda/bin to PATH and /usr/local/cuda/lib64 to LD_LIBRARY_PATH if not already present:

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

GPU Out of Memory

nvidia-smi
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
CauseFix
A previous job's process is still holding memoryKill the PID shown by nvidia-smi (sudo kill <pid>) or reboot the node.
Batch size or sequence length is too largeReduce batch size, enable gradient accumulation, or switch to a higher-memory card.
KV cache fills up at long context lengthsLower max_model_len / context window, reduce concurrency, or use a card with more memory.
Memory leak — used memory grows without plateauRestart the workload; inspect for tensor accumulation across iterations.
Fragmentation after many allocationsFor PyTorch, set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True and restart the workload.

Thermal Throttling and Power Capping

nvidia-smi --query-gpu=index,temperature.gpu,power.draw,power.limit,clocks_event_reasons.active --format=csv
nvidia-smi -q -d PERFORMANCE | head -40
note

On NVIDIA driver 570.x and earlier, the field was named clocks_throttle_reasons.active. Starting with driver 580.x it was renamed to clocks_event_reasons.active. The old name still works as an input alias but the CSV output header always shows the new name.

SignMeaning / Fix
clocks_event_reasons.hw_thermal_slowdown: ActiveCard is at or above its thermal limit. Reduce duty cycle, balance work across cards, or open a support ticket if the temperature is anomalous for the workload.
clocks_event_reasons.hw_power_brake_slowdown: ActivePower capping. Check power.limit vs power.draw. If power.limit is lower than the card's stock TDP, reset it: sudo nvidia-smi -pl <stock_watts>.
Temperature sustained ≥85 °CLow cooling headroom. Coordinate with support — this is host-side.
Frequent clock drops with no thermal/power signalWorkload is hitting idle periods. Profile with nsys or nvprof and fix the data pipeline.

Persistence Mode and Slow First Iteration

If the first call after a long idle period is unusually slow, the driver may be unloading and reloading between processes.

nvidia-smi -q | grep Persistence
sudo nvidia-smi -pm 1 # enable persistence mode on all cards

Set persistence mode to Enabled on long-running inference and training nodes. Add nvidia-smi -pm 1 to a start script or systemd unit so it survives reboots.

systemctl enable nvidia-persistenced Returns "no installation config" Error

On Ubuntu 24.04 with NVIDIA driver 580.x, nvidia-persistenced.service is a static systemd unit. It starts automatically at boot as a dependency of the NVIDIA driver — you do not need to enable it.

systemctl is-active nvidia-persistenced   # should print "active"

If the output is active, persistence mode is already managed by the daemon. Simply run sudo nvidia-smi -pm 1 to enable it for the current session; the daemon will persist the setting across reboots. Running systemctl enable nvidia-persistenced will return the error above and can be safely ignored.

On driver 570.x and earlier the unit is not static, so systemctl enable is required there.


Docker Cannot Access the GPU

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
ErrorFix
docker: command not foundDocker is not pre-installed on Ubuntu 24.04-based GPU images. Install it: sudo apt install -y docker.io
unknown flag: --gpusDocker is too old. Upgrade Docker Engine to a current release.
could not select device driver "" with capabilities: [[gpu]]NVIDIA Container Toolkit is missing or not configured. Install and configure it:
nvidia-container-cli: initialization errorDriver is broken on the host. Fix nvidia-smi on the host first (see above).

Install or repair the NVIDIA Container Toolkit:

# Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Re-run the test container after the install completes.


Windows GPU Node: Driver Issues

Device Manager Shows Microsoft Basic Display Adapter

The NVIDIA driver is not installed or did not load. Open PowerShell:

Get-PnpDevice -Class Display
nvidia-smi

If nvidia-smi reports the same "couldn't communicate with the NVIDIA driver" message as Linux, reinstall the driver from the NVIDIA Datacenter Driver page for the card and Windows Server version.

Switching Between WDDM and TCC

For RDP-rendered visualization, the card must be in WDDM mode. For headless compute, TCC can give slightly better performance but disables the display path.

nvidia-smi -dm 0     # WDDM (display + compute)
nvidia-smi -dm 1 # TCC (compute only — RDP desktop will not render through this GPU)

Reboot after changing the mode.


When to Open a Support Ticket

Gather this before contacting support — it makes diagnosis significantly faster:

nvidia-smi -q > nvidia-smi-q.txt
dmesg | grep -i -E 'nvidia|nvrm|xid' > nvidia-dmesg.txt
uname -a > sysinfo.txt
cat /etc/os-release >> sysinfo.txt
cat /proc/driver/nvidia/version >> sysinfo.txt 2>/dev/null

Attach those three files plus the GPU node ID, the saved-image ID (if any), and the time range of the failure.


ResourceUse it for
Node TroubleshootingNetworking, disk, security, monitoring graphs, lifecycle issues.
Node Not AccessibleCan't SSH or RDP into the node at all.
Disk Space IssuesGPU images and model weights fill the root disk fast.
Connect to a Linux GPU nodeBaseline nvidia-smi check after launch.
Connect to a Windows GPU nodeBaseline driver check on Windows.
Manage GPU NodesGPU-specific lifecycle and monitoring differences.
Last updated on May 26, 2026.