Troubleshooting and FAQs
FAQs
1. What is the difference between a Training Cluster and an On-Demand Instance?
An On-Demand Instance is a single containerized compute environment billed per hour while running. A Training Cluster is a dedicated pool of GPU nodes with Slurm-native scheduling, billed at a fixed plan rate. All Slurm jobs running on the cluster are included at no additional charge, and the cluster supports multi-node distributed training, elastic scaling, and high availability.
2. How do I run a training job on my cluster?
SSH into the cluster using the command shown in the Details tab under Connection Details (or click the Connect button for step-by-step instructions), then submit jobs using standard Slurm commands (sbatch, srun). The Jobs tab shows live squeue output — running, pending, completed, and failed jobs — without requiring SSH.
3. What images are available?
Training Cluster provides Ubuntu Slurm pre-built images in multiple versions. Each image includes pre-installed GPU drivers, CUDA, NCCL, and the Slurm runtime. You can also run any Docker or OCI-compatible image using Enroot — specify it during cluster creation in the Image field.
4. Can I change the image after the cluster is created?
Yes. Use Update Image from the Actions menu to change the container image on a running cluster without recreating it. The change applies after a node restart.
5. Can I stop billing without terminating the cluster?
No. Billing runs continuously while the cluster exists. To stop billing, use Terminate Cluster from the Actions menu. See Billing for details.
6. Where should I save checkpoints and training output?
Save checkpoints and logs to your attached storage volumes — PFS, SFS, or Datasets — managed from the Volumes tab. Data stored only on a node's local filesystem does not persist across restarts.
7. Can I scale the cluster without terminating it?
Yes. Use Scale Cluster from the Actions menu to add nodes to a running cluster. Existing jobs and node reservations are preserved during scale-up. Attaching new storage volumes after creation requires a cluster restart.
8. What happens if a node fails?
Failures are isolated — other nodes continue running unaffected. The cluster resumes automatically after the failed node restarts. Use the Nodes tab and filter by Failed or XID Errors to identify the affected node. Checkpoint jobs regularly so they can resume from the last saved state if needed.
9. What does "Convert to Committed" mean?
It switches your cluster from On-Demand hourly pricing to a Committed pricing plan. Use Convert to Committed from the Actions menu. See Billing for plan comparison details.
10. How do I update SSH keys after cluster creation?
Use Update SSH Keys from the Actions menu to add or replace SSH keys on all cluster nodes.
11. Can I clone an existing cluster?
Yes. Use Clone Cluster from the Actions menu to create a new cluster with the same configuration — image, plan, node count, and settings.
Troubleshooting
1. Cluster Stuck in Creating State
Cause: Node provisioning is taking longer than usual or the requested plan is temporarily unavailable.
Resolution: Wait up to 10 minutes for the cluster to reach Running state. If it remains stuck, note the Cluster ID from the Details tab and contact support.
2. SSH Connection Refused
Cause: The security group does not allow inbound TCP on port 22, or cluster nodes are not yet fully started.
Resolution:
- Go to Network & Security → Security Groups and verify port 22 is open for inbound traffic.
- Wait for the cluster status to show Running before attempting SSH.
- Confirm the correct SSH key is associated with the cluster from Connection Details in the Details tab, or use Update SSH Keys from the Actions menu.
3. Jobs Stuck in Pending State
Cause: The requested resources exceed what is currently available on idle nodes, or all nodes are fully allocated.
Resolution:
- Open the Cluster Overview tab — check Idle Nodes and Allocated GPUs to understand available capacity.
- Open the Jobs tab to identify any long-running jobs holding resources.
- If the cluster lacks capacity, use Scale Cluster from the Actions menu to add more nodes.
4. CUDA Out of Memory
Cause: The training batch size or model size exceeds the GPU memory on the allocated node.
Resolution:
- Reduce the batch size in your training script.
- Enable gradient accumulation to simulate a larger effective batch size with less GPU memory.
- Use mixed precision training (
torch.cuda.amp) to reduce memory footprint. - Open the Nodes tab to check per-GPU memory utilization across nodes.
5. NCCL Timeout or Initialization Error
Cause: Network communication between nodes failed during NCCL initialization — typically due to misconfigured environment variables or blocked inter-node traffic.
Resolution:
- Confirm all nodes show IDLE or ALLOCATED in the Cluster Overview tab before launching multi-node jobs.
- Ensure
MASTER_ADDRandMASTER_PORTare correctly set in your job script. - Go to Network & Security → Security Groups and confirm inter-node communication ports are open.
6. XID Errors on a Node
Cause: NVIDIA GPU hardware error detected on one or more nodes.
Resolution:
- Open the Nodes tab and filter by XID Errors to identify the affected node.
- Use Restart All Workers from the Actions menu to restart compute nodes. If the error persists, use Restart Cluster.
- If XID errors continue after restart, note the Cluster ID and contact support.
7. Storage Not Accessible
Cause: A volume was attached after cluster creation without restarting the cluster, or was not correctly mounted.
Resolution:
- Open the Volumes tab and verify the volume shows Mounted status.
- If it shows Unmounted, attach it from the Volumes tab — new volume attachments after creation require a cluster restart to take effect.
8. Logs Not Showing Expected Output
Cause: The wrong replica is selected, or the log buffer has not refreshed.
Resolution:
- Open the Logs tab and confirm the correct replica is selected from the Select Replica dropdown (e.g.,
slurm-controller). - Enable Auto Refresh to stream logs continuously.
- Adjust Value of N to load more lines if recent output is not visible.
System Health Best Practices
- Check node health before submitting large jobs — Open the Cluster Overview tab and confirm nodes are IDLE and Slurm partitions are UP before submitting resource-intensive workloads.
- Use the Jobs tab for status checks — The Jobs tab shows full squeue output in real time — no need to SSH in just to check job status.
- Save to attached storage — Always write checkpoints and logs to PFS, SFS, or Dataset volumes. Node-local data does not persist across restarts.
- Checkpoint frequently — Enable periodic checkpointing so jobs can resume from the last saved state after an unexpected node restart.
- Contact support for stuck states — If your cluster remains in a transitional state for more than 15 minutes, note the Cluster ID and contact the support team.