--- title: Troubleshoot GPU Nodes --- # Troubleshoot GPU Nodes This page covers the NVIDIA driver, CUDA, and GPU-specific command-line issues you can hit on a GPU node. For everything else — node not accessible, networking, disk space, security groups, encryption, monitoring graphs missing, lifecycle actions — use the canonical [Node Troubleshooting](/docs/myaccount/node/troubleshoot/) guides. A GPU node is a regular node, so the same fixes apply. If the symptom persists after the checks below, contact [cloud-platform@e2enetworks.com](mailto:cloud-platform@e2enetworks.com). --- ## `nvidia-smi` Does Not Work ### `nvidia-smi: command not found` The NVIDIA driver utilities are not on the host's `PATH`. ```bash which nvidia-smi ls /usr/bin/nvidia-smi /usr/local/cuda/bin/nvidia-smi 2>/dev/null dpkg -l | grep -i nvidia # Ubuntu / Debian rpm -qa | grep -i nvidia # Rocky / RHEL ``` If no NVIDIA packages are installed, the host is missing the datacenter driver. Reinstall it: ```bash # Ubuntu 22.04 / 24.04 sudo apt update sudo apt install -y nvidia-driver-580-server # or the branch your image targets sudo reboot # Rocky 9 sudo dnf install -y dnf-plugins-core sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo sudo dnf module install -y nvidia-driver:580-dkms sudo reboot ``` Pick the driver branch that matches the CUDA version you need (570.x → CUDA 12.8, 580.x → CUDA 13.0). Current E2E GPU nodes run the 580.x branch. ### `NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver` The userspace tool is present but the kernel module is not loaded (or the loaded module does not match the userspace version). ```bash lsmod | grep nvidia cat /proc/driver/nvidia/version 2>/dev/null dmesg | grep -i nvidia | tail -40 ``` Common causes and fixes: | Cause | Fix | | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | | Kernel was upgraded but the NVIDIA module was not rebuilt | `sudo dkms autoinstall && sudo reboot`, or reinstall the matching `nvidia-driver-*` package. | | Userspace driver version ≠ kernel module version | `apt reinstall nvidia-driver--server` (Ubuntu) or `dnf reinstall nvidia-driver` (Rocky), then reboot. | | Module is blacklisted (`nouveau` or `nvidia` in `/etc/modprobe.d`) | Remove the blacklist file, `sudo update-initramfs -u`, reboot. | | Secure Boot rejected the unsigned module | Disable Secure Boot on the node or enroll the NVIDIA signing MOK key. | | Driver was partially installed and `/var/log/nvidia-installer.log` shows errors | Purge and reinstall: `sudo apt purge '*nvidia*' && sudo apt autoremove`, then reinstall. | After any driver reinstall, reboot and re-run `nvidia-smi`. --- ## GPU Card Not Detected `nvidia-smi` runs but shows fewer cards than expected, or `No devices were found`. ```bash lspci | grep -i nvidia nvidia-smi -L dmesg | grep -i -E 'nvidia|nvrm|xid' | tail -60 ``` | Symptom | Likely cause | Fix | | ------------------------------------------------------------------------- | --------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | | `lspci` shows the card but `nvidia-smi -L` does not | Driver loaded but failed to bind to the device. | Reboot. If it persists, reinstall the driver and check `dmesg` for `Xid` errors. | | Card count is lower than the plan (e.g., 4× plan shows 2 cards) | One or more cards failed to initialize. | Reboot. If still missing, save an image and open a support ticket — this is a host-side issue. | | `Xid 79` (GPU fallen off the bus) in `dmesg` | Hardware fault or PCIe link reset. | Reboot; if the Xid returns, contact support — the host needs to be inspected. | | `Xid 13`, `Xid 31`, `Xid 43`, `Xid 48` repeating | Application-side GPU memory or program fault. | Restart the workload; if reproducible across nodes, debug the application (check kernel launch parameters and shared memory). | | `RmInitAdapter failed` in `dmesg` | Kernel module loaded against the wrong device topology. | Reboot. If persistent, reinstall the driver. | For repeated `Xid` errors with the same number, capture `nvidia-smi -q`, `dmesg`, and the saved-image ID before opening a support ticket. (`nvidia-smi -q -d ERROR` is not a valid flag; use `nvidia-smi -q -d ECC` to isolate ECC error counts specifically.) --- ## CUDA Version Issues ### Application reports a CUDA version that the driver does not support `nvidia-smi` shows the **maximum** CUDA the driver supports. The application's CUDA must be ≤ that number. ```bash nvidia-smi | head -3 nvcc --version 2>/dev/null python -c "import torch; print(torch.version.cuda, torch.cuda.is_available())" 2>/dev/null ``` | Situation | Fix | | ---------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- | | App wants CUDA 12.4, `nvidia-smi` shows `CUDA Version: 12.2` | Upgrade the driver to a branch that supports CUDA ≥ 12.4 (570.x or newer), then reboot. | | App wants CUDA 11.8, `nvidia-smi` shows `CUDA Version: 13.0` | Forward-compatible. No driver change needed. If the app links against `libcudart.so.11.0`, install the CUDA 11.8 runtime libraries (the toolkit, not the driver). | | `nvcc --version` differs from `nvidia-smi` CUDA Version | Expected. `nvcc` shows the installed **toolkit**; `nvidia-smi` shows the driver's supported CUDA. The toolkit can be older. | | `python -c "import torch"` returns `torch.cuda.is_available() == False` | Wrong wheel installed for the driver. Reinstall PyTorch matching the driver's CUDA — see the [PyTorch install matrix](https://pytorch.org/get-started/locally/). | ### `nvcc: command not found` The CUDA **toolkit** is not installed on the host. The toolkit is only needed if you compile CUDA code on the host itself. On container-based images, install the toolkit **inside** the container instead. ```bash # Ubuntu — add the NVIDIA CUDA apt repo first if not already configured: # https://developer.nvidia.com/cuda-downloads (select Linux → x86_64 → Ubuntu → 24.04 → deb(network)) sudo apt install -y cuda-toolkit-12-8 # match driver branch; use cuda-toolkit-13-0 for driver 580.x # or sudo dnf install -y cuda-toolkit-12-8 # Rocky 9, after adding the CUDA repo ``` Add `/usr/local/cuda/bin` to `PATH` and `/usr/local/cuda/lib64` to `LD_LIBRARY_PATH` if not already present: ```bash echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc source ~/.bashrc ``` --- ## GPU Out of Memory ```bash nvidia-smi nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv ``` | Cause | Fix | | ------------------------------------------------ | ------------------------------------------------------------------------------------------------------ | | A previous job's process is still holding memory | Kill the PID shown by `nvidia-smi` (`sudo kill `) or reboot the node. | | Batch size or sequence length is too large | Reduce batch size, enable gradient accumulation, or switch to a higher-memory card. | | KV cache fills up at long context lengths | Lower `max_model_len` / context window, reduce concurrency, or use a card with more memory. | | Memory leak — used memory grows without plateau | Restart the workload; inspect for tensor accumulation across iterations. | | Fragmentation after many allocations | For PyTorch, set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` and restart the workload. | --- ## Thermal Throttling and Power Capping ```bash nvidia-smi --query-gpu=index,temperature.gpu,power.draw,power.limit,clocks_event_reasons.active --format=csv nvidia-smi -q -d PERFORMANCE | head -40 ``` :::note On NVIDIA driver 570.x and earlier, the field was named `clocks_throttle_reasons.active`. Starting with driver 580.x it was renamed to `clocks_event_reasons.active`. The old name still works as an input alias but the CSV output header always shows the new name. ::: | Sign | Meaning / Fix | | --------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | | `clocks_event_reasons.hw_thermal_slowdown: Active` | Card is at or above its thermal limit. Reduce duty cycle, balance work across cards, or open a support ticket if the temperature is anomalous for the workload. | | `clocks_event_reasons.hw_power_brake_slowdown: Active` | Power capping. Check `power.limit` vs `power.draw`. If `power.limit` is lower than the card's stock TDP, reset it: `sudo nvidia-smi -pl `. | | Temperature sustained ≥85 °C | Low cooling headroom. Coordinate with support — this is host-side. | | Frequent clock drops with no thermal/power signal | Workload is hitting idle periods. Profile with `nsys` or `nvprof` and fix the data pipeline. | --- ## Persistence Mode and Slow First Iteration If the first call after a long idle period is unusually slow, the driver may be unloading and reloading between processes. ```bash nvidia-smi -q | grep Persistence sudo nvidia-smi -pm 1 # enable persistence mode on all cards ``` Set persistence mode to `Enabled` on long-running inference and training nodes. Add `nvidia-smi -pm 1` to a start script or systemd unit so it survives reboots. ### `systemctl enable nvidia-persistenced` Returns "no installation config" Error On Ubuntu 24.04 with NVIDIA driver 580.x, `nvidia-persistenced.service` is a **static** systemd unit. It starts automatically at boot as a dependency of the NVIDIA driver — you do not need to enable it. ```bash systemctl is-active nvidia-persistenced # should print "active" ``` If the output is `active`, persistence mode is already managed by the daemon. Simply run `sudo nvidia-smi -pm 1` to enable it for the current session; the daemon will persist the setting across reboots. Running `systemctl enable nvidia-persistenced` will return the error above and can be safely ignored. On driver 570.x and earlier the unit is not static, so `systemctl enable` is required there. --- ## Docker Cannot Access the GPU ```bash docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi ``` | Error | Fix | | ---------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | | `docker: command not found` | Docker is not pre-installed on Ubuntu 24.04-based GPU images. Install it: `sudo apt install -y docker.io` | | `unknown flag: --gpus` | Docker is too old. Upgrade Docker Engine to a current release. | | `could not select device driver "" with capabilities: [[gpu]]` | NVIDIA Container Toolkit is missing or not configured. Install and configure it: | | `nvidia-container-cli: initialization error` | Driver is broken on the host. Fix `nvidia-smi` on the host first (see above). | Install or repair the NVIDIA Container Toolkit: ```bash # Ubuntu curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt update && sudo apt install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker ``` Re-run the test container after the install completes. --- ## Windows GPU Node: Driver Issues ### Device Manager Shows `Microsoft Basic Display Adapter` The NVIDIA driver is not installed or did not load. Open PowerShell: ```powershell Get-PnpDevice -Class Display nvidia-smi ``` If `nvidia-smi` reports the same "couldn't communicate with the NVIDIA driver" message as Linux, reinstall the driver from the [NVIDIA Datacenter Driver](https://www.nvidia.com/Download/Find.aspx) page for the card and Windows Server version. ### Switching Between WDDM and TCC For RDP-rendered visualization, the card must be in **WDDM** mode. For headless compute, **TCC** can give slightly better performance but disables the display path. ```powershell nvidia-smi -dm 0 # WDDM (display + compute) nvidia-smi -dm 1 # TCC (compute only — RDP desktop will not render through this GPU) ``` Reboot after changing the mode. --- ## When to Open a Support Ticket Gather this before contacting support — it makes diagnosis significantly faster: ```bash nvidia-smi -q > nvidia-smi-q.txt dmesg | grep -i -E 'nvidia|nvrm|xid' > nvidia-dmesg.txt uname -a > sysinfo.txt cat /etc/os-release >> sysinfo.txt cat /proc/driver/nvidia/version >> sysinfo.txt 2>/dev/null ``` Attach those three files plus the GPU node ID, the saved-image ID (if any), and the time range of the failure. --- ## Related Resources | Resource | Use it for | | ---------------------------------------------------------------------------------------- | ---------------------------------------------------------------- | | [Node Troubleshooting](/docs/myaccount/node/troubleshoot/) | Networking, disk, security, monitoring graphs, lifecycle issues. | | [Node Not Accessible](/docs/myaccount/node/troubleshoot/node-not-accessible) | Can't SSH or RDP into the node at all. | | [Disk Space Issues](/docs/myaccount/node/troubleshoot/disk-space) | GPU images and model weights fill the root disk fast. | | [Connect to a Linux GPU node](/docs/myaccount/gpu/connect-to-gpu/linux-gpu-node) | Baseline `nvidia-smi` check after launch. | | [Connect to a Windows GPU node](/docs/myaccount/gpu/connect-to-gpu/windows-gpu-node) | Baseline driver check on Windows. | | [Manage GPU Nodes](/docs/myaccount/gpu/manage) | GPU-specific lifecycle and monitoring differences. |