---
title: Running Training Jobs
---

Training jobs are submitted to your cluster via Slurm after connecting via SSH. The job's container environment is determined by the image selected at cluster creation, or a custom image run via Enroot.

import { DeploymentCards } from '@site/docs/tir/TrainingCluster/deployments/deployments'

## Supported Frameworks

Choose a framework guide based on your workload and training approach.

<DeploymentCards/>

| Framework | Best For |
|-----------|----------|
| **PyTorch Distributed** | Multi-GPU DDP training across nodes |
| **PyTorch Lightning** | High-level distributed training abstraction |
| **Slurm** | Batch job scheduling for multi-node workloads |
| **OpenMPI** | MPI-based multi-node distributed training |

---

## Submitting Jobs

Connect to your cluster via SSH, then submit jobs using standard Slurm commands:

```bash
# Submit a batch job
sbatch my_training_job.sh

# Run an interactive job
srun --gpus=1 python train.py

# Check job status
squeue
```

The **Jobs** tab on the cluster dashboard shows all running, pending, completed, and failed jobs in real time — no SSH required for status monitoring.

---

## Monitoring Jobs

| Where | What you see |
|-------|-------------|
| **Jobs tab** | Full squeue — job ID, name, user, partition, nodes, GPUs, run time, state, priority |
| **Monitoring tab** | Per-node GPU utilisation % over selectable time windows |
| **Nodes tab** | Node health, DCGM metrics, XID error detection |
| **Logs tab** | Slurm controller and node logs |

---

## Storage for Training Data

All cluster nodes have access to attached storage volumes. Save training data, checkpoints, and logs to these paths — not to the node's local filesystem, which does not persist across restarts.

| Storage Type | Use Case |
|-------------|----------|
| [**Parallel File System (PFS)**](/docs/tir/pfs/) | High-throughput I/O for large datasets and checkpoints |
| [**Shared File System (SFS)**](/docs/tir/sfs/) | POSIX-compatible shared storage across nodes |
| [**Datasets**](/docs/tir/Datasets/) | Managed versioned datasets |

Manage attached volumes from the **Volumes** tab on the cluster detail page.


---