---
title: Troubleshooting and FAQs
---

## FAQs

**1. What is the difference between a Training Cluster and an On-Demand Instance?**

An On-Demand Instance is a single containerized compute environment billed per hour while running. A Training Cluster is a dedicated pool of GPU nodes with Slurm-native scheduling, billed at a fixed plan rate. All Slurm jobs running on the cluster are included at no additional charge, and the cluster supports multi-node distributed training, elastic scaling, and high availability.

---

**2. How do I run a training job on my cluster?**

SSH into the cluster using the command shown in the **Details** tab under **Connection Details** (or click the **Connect** button for step-by-step instructions), then submit jobs using standard Slurm commands (`sbatch`, `srun`). The **Jobs** tab shows live squeue output — running, pending, completed, and failed jobs — without requiring SSH.

---

**3. What images are available?**

Training Cluster provides **Ubuntu Slurm** pre-built images in multiple versions. Each image includes pre-installed GPU drivers, CUDA, NCCL, and the Slurm runtime. You can also run any Docker or OCI-compatible image using Enroot — specify it during cluster creation in the **Image** field.

---

**4. Can I change the image after the cluster is created?**

Yes. Use **Update Image** from the Actions menu to change the container image on a running cluster without recreating it. The change applies after a node restart.

---

**5. Can I stop billing without terminating the cluster?**

No. Billing runs continuously while the cluster exists. To stop billing, use **Terminate Cluster** from the Actions menu. See [Billing](/docs/tir/TrainingCluster/tc-billing) for details.

---

**6. Where should I save checkpoints and training output?**

Save checkpoints and logs to your attached storage volumes — PFS, SFS, or Datasets — managed from the **Volumes** tab. Data stored only on a node's local filesystem does not persist across restarts.

---

**7. Can I scale the cluster without terminating it?**

Yes. Use **Scale Cluster** from the Actions menu to add nodes to a running cluster. Existing jobs and node reservations are preserved during scale-up. Attaching new storage volumes after creation requires a cluster restart.

---

**8. What happens if a node fails?**

Failures are isolated — other nodes continue running unaffected. The cluster resumes automatically after the failed node restarts. Use the **Nodes** tab and filter by **Failed** or **XID Errors** to identify the affected node. Checkpoint jobs regularly so they can resume from the last saved state if needed.

---

**9. What does "Convert to Committed" mean?**

It switches your cluster from On-Demand hourly pricing to a Committed pricing plan. Use **Convert to Committed** from the Actions menu. See [Billing](/docs/tir/TrainingCluster/tc-billing) for plan comparison details.

---

**10. How do I update SSH keys after cluster creation?**

Use **Update SSH Keys** from the Actions menu to add or replace SSH keys on all cluster nodes.

---

**11. Can I clone an existing cluster?**

Yes. Use **Clone Cluster** from the Actions menu to create a new cluster with the same configuration — image, plan, node count, and settings.

---

## Troubleshooting

### 1. Cluster Stuck in Creating State

**Cause:** Node provisioning is taking longer than usual or the requested plan is temporarily unavailable.

**Resolution:** Wait up to **10 minutes** for the cluster to reach **Running** state. If it remains stuck, note the **Cluster ID** from the **Details** tab and contact support.

---

### 2. SSH Connection Refused

**Cause:** The security group does not allow inbound TCP on port 22, or cluster nodes are not yet fully started.

**Resolution:**
1. Go to **Network & Security** → **Security Groups** and verify port 22 is open for inbound traffic.
2. Wait for the cluster status to show **Running** before attempting SSH.
3. Confirm the correct SSH key is associated with the cluster from **Connection Details** in the **Details** tab, or use **Update SSH Keys** from the Actions menu.

---

### 3. Jobs Stuck in Pending State

**Cause:** The requested resources exceed what is currently available on idle nodes, or all nodes are fully allocated.

**Resolution:**
1. Open the **Cluster Overview** tab — check **Idle Nodes** and **Allocated GPUs** to understand available capacity.
2. Open the **Jobs** tab to identify any long-running jobs holding resources.
3. If the cluster lacks capacity, use **Scale Cluster** from the Actions menu to add more nodes.

---

### 4. CUDA Out of Memory

**Cause:** The training batch size or model size exceeds the GPU memory on the allocated node.

**Resolution:**
- Reduce the batch size in your training script.
- Enable gradient accumulation to simulate a larger effective batch size with less GPU memory.
- Use mixed precision training (`torch.cuda.amp`) to reduce memory footprint.
- Open the **Nodes** tab to check per-GPU memory utilization across nodes.

---

### 5. NCCL Timeout or Initialization Error

**Cause:** Network communication between nodes failed during NCCL initialization — typically due to misconfigured environment variables or blocked inter-node traffic.

**Resolution:**
1. Confirm all nodes show **IDLE** or **ALLOCATED** in the **Cluster Overview** tab before launching multi-node jobs.
2. Ensure `MASTER_ADDR` and `MASTER_PORT` are correctly set in your job script.
3. Go to **Network & Security** → **Security Groups** and confirm inter-node communication ports are open.

---

### 6. XID Errors on a Node

**Cause:** NVIDIA GPU hardware error detected on one or more nodes.

**Resolution:**
1. Open the **Nodes** tab and filter by **XID Errors** to identify the affected node.
2. Use **Restart All Workers** from the Actions menu to restart compute nodes. If the error persists, use **Restart Cluster**.
3. If XID errors continue after restart, note the Cluster ID and contact support.

---

### 7. Storage Not Accessible

**Cause:** A volume was attached after cluster creation without restarting the cluster, or was not correctly mounted.

**Resolution:**
1. Open the **Volumes** tab and verify the volume shows **Mounted** status.
2. If it shows **Unmounted**, attach it from the Volumes tab — new volume attachments after creation require a cluster restart to take effect.

---

### 8. Logs Not Showing Expected Output

**Cause:** The wrong replica is selected, or the log buffer has not refreshed.

**Resolution:**
1. Open the **Logs** tab and confirm the correct replica is selected from the **Select Replica** dropdown (e.g., `slurm-controller`).
2. Enable **Auto Refresh** to stream logs continuously.
3. Adjust **Value of N** to load more lines if recent output is not visible.

---

## System Health Best Practices

- **Check node health before submitting large jobs** — Open the **Cluster Overview** tab and confirm nodes are IDLE and Slurm partitions are UP before submitting resource-intensive workloads.
- **Use the Jobs tab for status checks** — The Jobs tab shows full squeue output in real time — no need to SSH in just to check job status.
- **Save to attached storage** — Always write checkpoints and logs to PFS, SFS, or Dataset volumes. Node-local data does not persist across restarts.
- **Checkpoint frequently** — Enable periodic checkpointing so jobs can resume from the last saved state after an unexpected node restart.
- **Contact support for stuck states** — If your cluster remains in a transitional state for more than **15 minutes**, note the Cluster ID and contact the support team.


---