--- title: Running Training Jobs --- Training jobs are submitted to your cluster via Slurm after connecting via SSH. The job's container environment is determined by the image selected at cluster creation, or a custom image run via Enroot. import { DeploymentCards } from '@site/docs/tir/TrainingCluster/deployments/deployments' ## Supported Frameworks Choose a framework guide based on your workload and training approach. | Framework | Best For | |-----------|----------| | **PyTorch Distributed** | Multi-GPU DDP training across nodes | | **PyTorch Lightning** | High-level distributed training abstraction | | **Slurm** | Batch job scheduling for multi-node workloads | | **OpenMPI** | MPI-based multi-node distributed training | --- ## Submitting Jobs Connect to your cluster via SSH, then submit jobs using standard Slurm commands: ```bash # Submit a batch job sbatch my_training_job.sh # Run an interactive job srun --gpus=1 python train.py # Check job status squeue ``` The **Jobs** tab on the cluster dashboard shows all running, pending, completed, and failed jobs in real time — no SSH required for status monitoring. --- ## Monitoring Jobs | Where | What you see | |-------|-------------| | **Jobs tab** | Full squeue — job ID, name, user, partition, nodes, GPUs, run time, state, priority | | **Monitoring tab** | Per-node GPU utilisation % over selectable time windows | | **Nodes tab** | Node health, DCGM metrics, XID error detection | | **Logs tab** | Slurm controller and node logs | --- ## Storage for Training Data All cluster nodes have access to attached storage volumes. Save training data, checkpoints, and logs to these paths — not to the node's local filesystem, which does not persist across restarts. | Storage Type | Use Case | |-------------|----------| | [**Parallel File System (PFS)**](/docs/tir/pfs/) | High-throughput I/O for large datasets and checkpoints | | [**Shared File System (SFS)**](/docs/tir/sfs/) | POSIX-compatible shared storage across nodes | | [**Datasets**](/docs/tir/Datasets/) | Managed versioned datasets | Manage attached volumes from the **Volumes** tab on the cluster detail page. ---