Running Training Jobs
Training jobs are submitted to your cluster via Slurm after connecting via SSH. The job's container environment is determined by the image selected at cluster creation, or a custom image run via Enroot.
Supported Frameworks
Choose a framework guide based on your workload and training approach.
| Framework | Best For |
|---|---|
| PyTorch Distributed | Multi-GPU DDP training across nodes |
| PyTorch Lightning | High-level distributed training abstraction |
| Slurm | Batch job scheduling for multi-node workloads |
| OpenMPI | MPI-based multi-node distributed training |
Submitting Jobs
Connect to your cluster via SSH, then submit jobs using standard Slurm commands:
# Submit a batch job
sbatch my_training_job.sh
# Run an interactive job
srun --gpus=1 python train.py
# Check job status
squeue
The Jobs tab on the cluster dashboard shows all running, pending, completed, and failed jobs in real time — no SSH required for status monitoring.
Monitoring Jobs
| Where | What you see |
|---|---|
| Jobs tab | Full squeue — job ID, name, user, partition, nodes, GPUs, run time, state, priority |
| Monitoring tab | Per-node GPU utilisation % over selectable time windows |
| Nodes tab | Node health, DCGM metrics, XID error detection |
| Logs tab | Slurm controller and node logs |
Storage for Training Data
All cluster nodes have access to attached storage volumes. Save training data, checkpoints, and logs to these paths — not to the node's local filesystem, which does not persist across restarts.
| Storage Type | Use Case |
|---|---|
| Parallel File System (PFS) | High-throughput I/O for large datasets and checkpoints |
| Shared File System (SFS) | POSIX-compatible shared storage across nodes |
| Datasets | Managed versioned datasets |
Manage attached volumes from the Volumes tab on the cluster detail page.