--- title: Troubleshooting and FAQs --- ## FAQs **1. What is the difference between a Training Cluster and an On-Demand Instance?** An On-Demand Instance is a single containerized compute environment billed per hour while running. A Training Cluster is a dedicated pool of GPU nodes with Slurm-native scheduling, billed at a fixed plan rate. All Slurm jobs running on the cluster are included at no additional charge, and the cluster supports multi-node distributed training, elastic scaling, and high availability. --- **2. How do I run a training job on my cluster?** SSH into the cluster using the command shown in the **Details** tab under **Connection Details** (or click the **Connect** button for step-by-step instructions), then submit jobs using standard Slurm commands (`sbatch`, `srun`). The **Jobs** tab shows live squeue output — running, pending, completed, and failed jobs — without requiring SSH. --- **3. What images are available?** Training Cluster provides **Ubuntu Slurm** pre-built images in multiple versions. Each image includes pre-installed GPU drivers, CUDA, NCCL, and the Slurm runtime. You can also run any Docker or OCI-compatible image using Enroot — specify it during cluster creation in the **Image** field. --- **4. Can I change the image after the cluster is created?** Yes. Use **Update Image** from the Actions menu to change the container image on a running cluster without recreating it. The change applies after a node restart. --- **5. Can I stop billing without terminating the cluster?** No. Billing runs continuously while the cluster exists. To stop billing, use **Terminate Cluster** from the Actions menu. See [Billing](/docs/tir/TrainingCluster/tc-billing) for details. --- **6. Where should I save checkpoints and training output?** Save checkpoints and logs to your attached storage volumes — PFS, SFS, or Datasets — managed from the **Volumes** tab. Data stored only on a node's local filesystem does not persist across restarts. --- **7. Can I scale the cluster without terminating it?** Yes. Use **Scale Cluster** from the Actions menu to add nodes to a running cluster. Existing jobs and node reservations are preserved during scale-up. Attaching new storage volumes after creation requires a cluster restart. --- **8. What happens if a node fails?** Failures are isolated — other nodes continue running unaffected. The cluster resumes automatically after the failed node restarts. Use the **Nodes** tab and filter by **Failed** or **XID Errors** to identify the affected node. Checkpoint jobs regularly so they can resume from the last saved state if needed. --- **9. What does "Convert to Committed" mean?** It switches your cluster from On-Demand hourly pricing to a Committed pricing plan. Use **Convert to Committed** from the Actions menu. See [Billing](/docs/tir/TrainingCluster/tc-billing) for plan comparison details. --- **10. How do I update SSH keys after cluster creation?** Use **Update SSH Keys** from the Actions menu to add or replace SSH keys on all cluster nodes. --- **11. Can I clone an existing cluster?** Yes. Use **Clone Cluster** from the Actions menu to create a new cluster with the same configuration — image, plan, node count, and settings. --- ## Troubleshooting ### 1. Cluster Stuck in Creating State **Cause:** Node provisioning is taking longer than usual or the requested plan is temporarily unavailable. **Resolution:** Wait up to **10 minutes** for the cluster to reach **Running** state. If it remains stuck, note the **Cluster ID** from the **Details** tab and contact support. --- ### 2. SSH Connection Refused **Cause:** The security group does not allow inbound TCP on port 22, or cluster nodes are not yet fully started. **Resolution:** 1. Go to **Network & Security** → **Security Groups** and verify port 22 is open for inbound traffic. 2. Wait for the cluster status to show **Running** before attempting SSH. 3. Confirm the correct SSH key is associated with the cluster from **Connection Details** in the **Details** tab, or use **Update SSH Keys** from the Actions menu. --- ### 3. Jobs Stuck in Pending State **Cause:** The requested resources exceed what is currently available on idle nodes, or all nodes are fully allocated. **Resolution:** 1. Open the **Cluster Overview** tab — check **Idle Nodes** and **Allocated GPUs** to understand available capacity. 2. Open the **Jobs** tab to identify any long-running jobs holding resources. 3. If the cluster lacks capacity, use **Scale Cluster** from the Actions menu to add more nodes. --- ### 4. CUDA Out of Memory **Cause:** The training batch size or model size exceeds the GPU memory on the allocated node. **Resolution:** - Reduce the batch size in your training script. - Enable gradient accumulation to simulate a larger effective batch size with less GPU memory. - Use mixed precision training (`torch.cuda.amp`) to reduce memory footprint. - Open the **Nodes** tab to check per-GPU memory utilization across nodes. --- ### 5. NCCL Timeout or Initialization Error **Cause:** Network communication between nodes failed during NCCL initialization — typically due to misconfigured environment variables or blocked inter-node traffic. **Resolution:** 1. Confirm all nodes show **IDLE** or **ALLOCATED** in the **Cluster Overview** tab before launching multi-node jobs. 2. Ensure `MASTER_ADDR` and `MASTER_PORT` are correctly set in your job script. 3. Go to **Network & Security** → **Security Groups** and confirm inter-node communication ports are open. --- ### 6. XID Errors on a Node **Cause:** NVIDIA GPU hardware error detected on one or more nodes. **Resolution:** 1. Open the **Nodes** tab and filter by **XID Errors** to identify the affected node. 2. Use **Restart All Workers** from the Actions menu to restart compute nodes. If the error persists, use **Restart Cluster**. 3. If XID errors continue after restart, note the Cluster ID and contact support. --- ### 7. Storage Not Accessible **Cause:** A volume was attached after cluster creation without restarting the cluster, or was not correctly mounted. **Resolution:** 1. Open the **Volumes** tab and verify the volume shows **Mounted** status. 2. If it shows **Unmounted**, attach it from the Volumes tab — new volume attachments after creation require a cluster restart to take effect. --- ### 8. Logs Not Showing Expected Output **Cause:** The wrong replica is selected, or the log buffer has not refreshed. **Resolution:** 1. Open the **Logs** tab and confirm the correct replica is selected from the **Select Replica** dropdown (e.g., `slurm-controller`). 2. Enable **Auto Refresh** to stream logs continuously. 3. Adjust **Value of N** to load more lines if recent output is not visible. --- ## System Health Best Practices - **Check node health before submitting large jobs** — Open the **Cluster Overview** tab and confirm nodes are IDLE and Slurm partitions are UP before submitting resource-intensive workloads. - **Use the Jobs tab for status checks** — The Jobs tab shows full squeue output in real time — no need to SSH in just to check job status. - **Save to attached storage** — Always write checkpoints and logs to PFS, SFS, or Dataset volumes. Node-local data does not persist across restarts. - **Checkpoint frequently** — Enable periodic checkpointing so jobs can resume from the last saved state after an unexpected node restart. - **Contact support for stuck states** — If your cluster remains in a transitional state for more than **15 minutes**, note the Cluster ID and contact the support team. ---