Troubleshoot GPU Nodes
This page covers the NVIDIA driver, CUDA, and GPU-specific command-line issues you can hit on a GPU node. For everything else — node not accessible, networking, disk space, security groups, encryption, monitoring graphs missing, lifecycle actions — use the canonical Node Troubleshooting guides. A GPU node is a regular node, so the same fixes apply.
If the symptom persists after the checks below, contact cloud-platform@e2enetworks.com.
nvidia-smi Does Not Work
nvidia-smi: command not found
The NVIDIA driver utilities are not on the host's PATH.
which nvidia-smi
ls /usr/bin/nvidia-smi /usr/local/cuda/bin/nvidia-smi 2>/dev/null
dpkg -l | grep -i nvidia # Ubuntu / Debian
rpm -qa | grep -i nvidia # Rocky / RHEL
If no NVIDIA packages are installed, the host is missing the datacenter driver. Reinstall it:
# Ubuntu 22.04 / 24.04
sudo apt update
sudo apt install -y nvidia-driver-580-server # or the branch your image targets
sudo reboot
# Rocky 9
sudo dnf install -y dnf-plugins-core
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf module install -y nvidia-driver:580-dkms
sudo reboot
Pick the driver branch that matches the CUDA version you need (570.x → CUDA 12.8, 580.x → CUDA 13.0). Current E2E GPU nodes run the 580.x branch.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver
The userspace tool is present but the kernel module is not loaded (or the loaded module does not match the userspace version).
lsmod | grep nvidia
cat /proc/driver/nvidia/version 2>/dev/null
dmesg | grep -i nvidia | tail -40
Common causes and fixes:
| Cause | Fix |
|---|---|
| Kernel was upgraded but the NVIDIA module was not rebuilt | sudo dkms autoinstall && sudo reboot, or reinstall the matching nvidia-driver-* package. |
| Userspace driver version ≠ kernel module version | apt reinstall nvidia-driver-<branch>-server (Ubuntu) or dnf reinstall nvidia-driver (Rocky), then reboot. |
Module is blacklisted (nouveau or nvidia in /etc/modprobe.d) | Remove the blacklist file, sudo update-initramfs -u, reboot. |
| Secure Boot rejected the unsigned module | Disable Secure Boot on the node or enroll the NVIDIA signing MOK key. |
Driver was partially installed and /var/log/nvidia-installer.log shows errors | Purge and reinstall: sudo apt purge '*nvidia*' && sudo apt autoremove, then reinstall. |
After any driver reinstall, reboot and re-run nvidia-smi.
GPU Card Not Detected
nvidia-smi runs but shows fewer cards than expected, or No devices were found.
lspci | grep -i nvidia
nvidia-smi -L
dmesg | grep -i -E 'nvidia|nvrm|xid' | tail -60
| Symptom | Likely cause | Fix |
|---|---|---|
lspci shows the card but nvidia-smi -L does not | Driver loaded but failed to bind to the device. | Reboot. If it persists, reinstall the driver and check dmesg for Xid errors. |
| Card count is lower than the plan (e.g., 4× plan shows 2 cards) | One or more cards failed to initialize. | Reboot. If still missing, save an image and open a support ticket — this is a host-side issue. |
Xid 79 (GPU fallen off the bus) in dmesg | Hardware fault or PCIe link reset. | Reboot; if the Xid returns, contact support — the host needs to be inspected. |
Xid 13, Xid 31, Xid 43, Xid 48 repeating | Application-side GPU memory or program fault. | Restart the workload; if reproducible across nodes, debug the application (check kernel launch parameters and shared memory). |
RmInitAdapter failed in dmesg | Kernel module loaded against the wrong device topology. | Reboot. If persistent, reinstall the driver. |
For repeated Xid errors with the same number, capture nvidia-smi -q, dmesg, and the saved-image ID before opening a support ticket. (nvidia-smi -q -d ERROR is not a valid flag; use nvidia-smi -q -d ECC to isolate ECC error counts specifically.)
CUDA Version Issues
Application reports a CUDA version that the driver does not support
nvidia-smi shows the maximum CUDA the driver supports. The application's CUDA must be ≤ that number.
nvidia-smi | head -3
nvcc --version 2>/dev/null
python -c "import torch; print(torch.version.cuda, torch.cuda.is_available())" 2>/dev/null
| Situation | Fix |
|---|---|
App wants CUDA 12.4, nvidia-smi shows CUDA Version: 12.2 | Upgrade the driver to a branch that supports CUDA ≥ 12.4 (570.x or newer), then reboot. |
App wants CUDA 11.8, nvidia-smi shows CUDA Version: 13.0 | Forward-compatible. No driver change needed. If the app links against libcudart.so.11.0, install the CUDA 11.8 runtime libraries (the toolkit, not the driver). |
nvcc --version differs from nvidia-smi CUDA Version | Expected. nvcc shows the installed toolkit; nvidia-smi shows the driver's supported CUDA. The toolkit can be older. |
python -c "import torch" returns torch.cuda.is_available() == False | Wrong wheel installed for the driver. Reinstall PyTorch matching the driver's CUDA — see the PyTorch install matrix. |
nvcc: command not found
The CUDA toolkit is not installed on the host. The toolkit is only needed if you compile CUDA code on the host itself. On container-based images, install the toolkit inside the container instead.
# Ubuntu — add the NVIDIA CUDA apt repo first if not already configured:
# https://developer.nvidia.com/cuda-downloads (select Linux → x86_64 → Ubuntu → 24.04 → deb(network))
sudo apt install -y cuda-toolkit-12-8 # match driver branch; use cuda-toolkit-13-0 for driver 580.x
# or
sudo dnf install -y cuda-toolkit-12-8 # Rocky 9, after adding the CUDA repo
Add /usr/local/cuda/bin to PATH and /usr/local/cuda/lib64 to LD_LIBRARY_PATH if not already present:
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
GPU Out of Memory
nvidia-smi
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
| Cause | Fix |
|---|---|
| A previous job's process is still holding memory | Kill the PID shown by nvidia-smi (sudo kill <pid>) or reboot the node. |
| Batch size or sequence length is too large | Reduce batch size, enable gradient accumulation, or switch to a higher-memory card. |
| KV cache fills up at long context lengths | Lower max_model_len / context window, reduce concurrency, or use a card with more memory. |
| Memory leak — used memory grows without plateau | Restart the workload; inspect for tensor accumulation across iterations. |
| Fragmentation after many allocations | For PyTorch, set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True and restart the workload. |
Thermal Throttling and Power Capping
nvidia-smi --query-gpu=index,temperature.gpu,power.draw,power.limit,clocks_event_reasons.active --format=csv
nvidia-smi -q -d PERFORMANCE | head -40
On NVIDIA driver 570.x and earlier, the field was named clocks_throttle_reasons.active. Starting with driver 580.x it was renamed to clocks_event_reasons.active. The old name still works as an input alias but the CSV output header always shows the new name.
| Sign | Meaning / Fix |
|---|---|
clocks_event_reasons.hw_thermal_slowdown: Active | Card is at or above its thermal limit. Reduce duty cycle, balance work across cards, or open a support ticket if the temperature is anomalous for the workload. |
clocks_event_reasons.hw_power_brake_slowdown: Active | Power capping. Check power.limit vs power.draw. If power.limit is lower than the card's stock TDP, reset it: sudo nvidia-smi -pl <stock_watts>. |
| Temperature sustained ≥85 °C | Low cooling headroom. Coordinate with support — this is host-side. |
| Frequent clock drops with no thermal/power signal | Workload is hitting idle periods. Profile with nsys or nvprof and fix the data pipeline. |
Persistence Mode and Slow First Iteration
If the first call after a long idle period is unusually slow, the driver may be unloading and reloading between processes.
nvidia-smi -q | grep Persistence
sudo nvidia-smi -pm 1 # enable persistence mode on all cards
Set persistence mode to Enabled on long-running inference and training nodes. Add nvidia-smi -pm 1 to a start script or systemd unit so it survives reboots.
systemctl enable nvidia-persistenced Returns "no installation config" Error
On Ubuntu 24.04 with NVIDIA driver 580.x, nvidia-persistenced.service is a static systemd unit. It starts automatically at boot as a dependency of the NVIDIA driver — you do not need to enable it.
systemctl is-active nvidia-persistenced # should print "active"
If the output is active, persistence mode is already managed by the daemon. Simply run sudo nvidia-smi -pm 1 to enable it for the current session; the daemon will persist the setting across reboots. Running systemctl enable nvidia-persistenced will return the error above and can be safely ignored.
On driver 570.x and earlier the unit is not static, so systemctl enable is required there.
Docker Cannot Access the GPU
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
| Error | Fix |
|---|---|
docker: command not found | Docker is not pre-installed on Ubuntu 24.04-based GPU images. Install it: sudo apt install -y docker.io |
unknown flag: --gpus | Docker is too old. Upgrade Docker Engine to a current release. |
could not select device driver "" with capabilities: [[gpu]] | NVIDIA Container Toolkit is missing or not configured. Install and configure it: |
nvidia-container-cli: initialization error | Driver is broken on the host. Fix nvidia-smi on the host first (see above). |
Install or repair the NVIDIA Container Toolkit:
# Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Re-run the test container after the install completes.
Windows GPU Node: Driver Issues
Device Manager Shows Microsoft Basic Display Adapter
The NVIDIA driver is not installed or did not load. Open PowerShell:
Get-PnpDevice -Class Display
nvidia-smi
If nvidia-smi reports the same "couldn't communicate with the NVIDIA driver" message as Linux, reinstall the driver from the NVIDIA Datacenter Driver page for the card and Windows Server version.
Switching Between WDDM and TCC
For RDP-rendered visualization, the card must be in WDDM mode. For headless compute, TCC can give slightly better performance but disables the display path.
nvidia-smi -dm 0 # WDDM (display + compute)
nvidia-smi -dm 1 # TCC (compute only — RDP desktop will not render through this GPU)
Reboot after changing the mode.
When to Open a Support Ticket
Gather this before contacting support — it makes diagnosis significantly faster:
nvidia-smi -q > nvidia-smi-q.txt
dmesg | grep -i -E 'nvidia|nvrm|xid' > nvidia-dmesg.txt
uname -a > sysinfo.txt
cat /etc/os-release >> sysinfo.txt
cat /proc/driver/nvidia/version >> sysinfo.txt 2>/dev/null
Attach those three files plus the GPU node ID, the saved-image ID (if any), and the time range of the failure.
Related Resources
| Resource | Use it for |
|---|---|
| Node Troubleshooting | Networking, disk, security, monitoring graphs, lifecycle issues. |
| Node Not Accessible | Can't SSH or RDP into the node at all. |
| Disk Space Issues | GPU images and model weights fill the root disk fast. |
| Connect to a Linux GPU node | Baseline nvidia-smi check after launch. |
| Connect to a Windows GPU node | Baseline driver check on Windows. |
| Manage GPU Nodes | GPU-specific lifecycle and monitoring differences. |