---
title: Slurm
---

**Slurm (Simple Linux Utility for Resource Management)** is a job scheduler designed for cluster environments. On TIR, a Slurm deployment provisions workers that form a Slurm cluster — you submit batch jobs using standard `sbatch` commands, and Slurm handles scheduling and resource allocation across your nodes.

---

## Environment

### Cluster Setup

* All nodes in the deployment share the same TIR-provided container image with CUDA, NCCL, and PyTorch pre-installed.
* The Slurm controller runs on the **Master Worker**. Submit jobs from there.

### Connect to the Master Worker

```bash
ssh $hostname
```

All other workers are accessible from the master via SSH using their worker hostnames.

### Shared Storage

Datasets, scripts, logs, and checkpoints must be placed in shared storage so all nodes can access them:

```bash
/mnt/shared
```

:::info
At least one storage volume (SFS, PFS, or Dataset) must be attached when creating a Slurm deployment.
:::

---

## Slurm Basics

| Concept | Description |
|---------|-------------|
| **Node** | A compute instance in the cluster |
| **Partition** | A named group of nodes used for scheduling |
| **Job** | A task submitted to the scheduler for execution |
| **Task** | A single process within a job |

### Essential Commands

```bash
sinfo                          # View node and partition status
squeue                         # List all queued and running jobs
squeue -u $USER                # List your jobs only
sbatch job_script.sh           # Submit a batch job
scancel JOB_ID                 # Cancel a job
scontrol show job JOB_ID       # Detailed job information
```

---

## Single-Node Job Script

For training on a single node with 8 GPUs, create `train_job.sh`:

```bash
#!/bin/bash
#SBATCH --job-name=single-node-training
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=02:00:00
#SBATCH --output=/mnt/shared/logs/%x_%j.out
#SBATCH --error=/mnt/shared/logs/%x_%j.err

export NCCL_DEBUG=INFO
export MASTER_ADDR=localhost
export MASTER_PORT=12345

srun torchrun --nproc_per_node=8 /mnt/shared/train.py
```

Submit it:

```bash
sbatch train_job.sh
```

### Key Directives

| Directive | Description |
|-----------|-------------|
| `--nodes` | Number of nodes to allocate |
| `--ntasks` | Total number of tasks (processes) |
| `--cpus-per-task` | CPU cores per task |
| `--gres=gpu:N` | Number of GPUs per node |
| `--partition` | Partition name (use `all` for default) |
| `--time` | Max job runtime in `hh:mm:ss` |
| `--output` / `--error` | Paths for stdout and stderr logs |

---

## Multi-Node Job Script

For distributed training across multiple nodes, create `multinode_job.sh`:

```bash
#!/bin/bash
#SBATCH --job-name=multi-node-training
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=04:00:00
#SBATCH --output=/mnt/shared/logs/%x_%j.out
#SBATCH --error=/mnt/shared/logs/%x_%j.err

export NCCL_DEBUG=INFO
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=12345

srun torchrun \
  --nproc_per_node=8 \
  --nnodes=$SLURM_JOB_NUM_NODES \
  --node_rank=$SLURM_NODEID \
  --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  /mnt/shared/train.py
```

---

## Advanced Slurm Features

### Job Arrays

Run the same job script multiple times with different parameters — useful for hyperparameter sweeps:

```bash
#SBATCH --array=1-10
```

Access the current task index in your script:

```bash
echo "Running task $SLURM_ARRAY_TASK_ID"
```

### Job Dependencies

Run jobs in sequence — the second job starts only after the first completes successfully:

```bash
JOB1=$(sbatch preprocess.sh | awk '{print $4}')
sbatch --dependency=afterok:$JOB1 train_job.sh
```

---

## Monitoring Jobs

### View Logs

Stdout and stderr are written to the paths defined in `--output` and `--error`:

```bash
tail -f /mnt/shared/logs/single-node-training_JOBID.out
```

### Check GPU Utilization

```bash
srun --jobid=JOB_ID nvidia-smi
```

Or connect to a worker and run:

```bash
watch -n 2 nvidia-smi
```

---

## Troubleshooting

| Issue | Cause | Resolution |
|-------|-------|------------|
| **Job stuck in queue (PD state)** | Insufficient resources or unsatisfied dependency | Check `sinfo` for node availability; verify dependency job status |
| **Job fails immediately** | Script error or missing file | Check the `.err` log; run `scontrol show job JOB_ID` for details |
| **NCCL timeout on multi-node** | Network or firewall issue between nodes | Ensure workers can reach each other; set `NCCL_DEBUG=INFO` |
| **Disk full** | Logs or checkpoints filling `/mnt/shared` | Delete old output files or increase storage quota |

---

## Resources

* [Official Slurm Documentation](https://slurm.schedmd.com/documentation.html)
* [Slurm `sbatch` Reference](https://slurm.schedmd.com/sbatch.html)


---