Skip to main content

Running Training Jobs

Training jobs are submitted to your cluster via Slurm after connecting via SSH. The job's container environment is determined by the image selected at cluster creation, or a custom image run via Enroot.

Supported Frameworks

Choose a framework guide based on your workload and training approach.

FrameworkBest For
PyTorch DistributedMulti-GPU DDP training across nodes
PyTorch LightningHigh-level distributed training abstraction
SlurmBatch job scheduling for multi-node workloads
OpenMPIMPI-based multi-node distributed training

Submitting Jobs

Connect to your cluster via SSH, then submit jobs using standard Slurm commands:

# Submit a batch job
sbatch my_training_job.sh

# Run an interactive job
srun --gpus=1 python train.py

# Check job status
squeue

The Jobs tab on the cluster dashboard shows all running, pending, completed, and failed jobs in real time — no SSH required for status monitoring.


Monitoring Jobs

WhereWhat you see
Jobs tabFull squeue — job ID, name, user, partition, nodes, GPUs, run time, state, priority
Monitoring tabPer-node GPU utilisation % over selectable time windows
Nodes tabNode health, DCGM metrics, XID error detection
Logs tabSlurm controller and node logs

Storage for Training Data

All cluster nodes have access to attached storage volumes. Save training data, checkpoints, and logs to these paths — not to the node's local filesystem, which does not persist across restarts.

Storage TypeUse Case
Parallel File System (PFS)High-throughput I/O for large datasets and checkpoints
Shared File System (SFS)POSIX-compatible shared storage across nodes
DatasetsManaged versioned datasets

Manage attached volumes from the Volumes tab on the cluster detail page.