Running Training Jobs

Training jobs are submitted to your cluster via Slurm after connecting via SSH. The job's container environment is determined by the image selected at cluster creation, or a custom image run via Enroot.

Supported Frameworks

Choose a framework guide based on your workload and training approach.

Framework	Best For
PyTorch Distributed	Multi-GPU DDP training across nodes
PyTorch Lightning	High-level distributed training abstraction
Slurm	Batch job scheduling for multi-node workloads
OpenMPI	MPI-based multi-node distributed training

Submitting Jobs

Connect to your cluster via SSH, then submit jobs using standard Slurm commands:

# Submit a batch job
sbatch my_training_job.sh

# Run an interactive job
srun --gpus=1 python train.py

# Check job status
squeue

The Jobs tab on the cluster dashboard shows all running, pending, completed, and failed jobs in real time — no SSH required for status monitoring.

Monitoring Jobs

Where	What you see
Jobs tab	Full squeue — job ID, name, user, partition, nodes, GPUs, run time, state, priority
Monitoring tab	Per-node GPU utilisation % over selectable time windows
Nodes tab	Node health, DCGM metrics, XID error detection
Logs tab	Slurm controller and node logs

Storage for Training Data

All cluster nodes have access to attached storage volumes. Save training data, checkpoints, and logs to these paths — not to the node's local filesystem, which does not persist across restarts.

Storage Type	Use Case
Parallel File System (PFS)	High-throughput I/O for large datasets and checkpoints
Shared File System (SFS)	POSIX-compatible shared storage across nodes
Datasets	Managed versioned datasets

Manage attached volumes from the Volumes tab on the cluster detail page.

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

Supported Frameworks​

Submitting Jobs​

Monitoring Jobs​

Storage for Training Data​

Supported Frameworks

Submitting Jobs

Monitoring Jobs

Storage for Training Data