Skip to main content

Slurm

Slurm (Simple Linux Utility for Resource Management) is a job scheduler designed for cluster environments. On TIR, a Slurm deployment provisions workers that form a Slurm cluster — you submit batch jobs using standard sbatch commands, and Slurm handles scheduling and resource allocation across your nodes.


Environment

Cluster Setup

  • All nodes in the deployment share the same TIR-provided container image with CUDA, NCCL, and PyTorch pre-installed.
  • The Slurm controller runs on the Master Worker. Submit jobs from there.

Connect to the Master Worker

ssh $hostname

All other workers are accessible from the master via SSH using their worker hostnames.

Shared Storage

Datasets, scripts, logs, and checkpoints must be placed in shared storage so all nodes can access them:

/mnt/shared
info

At least one storage volume (SFS, PFS, or Dataset) must be attached when creating a Slurm deployment.


Slurm Basics

ConceptDescription
NodeA compute instance in the cluster
PartitionA named group of nodes used for scheduling
JobA task submitted to the scheduler for execution
TaskA single process within a job

Essential Commands

sinfo                          # View node and partition status
squeue # List all queued and running jobs
squeue -u $USER # List your jobs only
sbatch job_script.sh # Submit a batch job
scancel JOB_ID # Cancel a job
scontrol show job JOB_ID # Detailed job information

Single-Node Job Script

For training on a single node with 8 GPUs, create train_job.sh:

#!/bin/bash
#SBATCH --job-name=single-node-training
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=02:00:00
#SBATCH --output=/mnt/shared/logs/%x_%j.out
#SBATCH --error=/mnt/shared/logs/%x_%j.err

export NCCL_DEBUG=INFO
export MASTER_ADDR=localhost
export MASTER_PORT=12345

srun torchrun --nproc_per_node=8 /mnt/shared/train.py

Submit it:

sbatch train_job.sh

Key Directives

DirectiveDescription
--nodesNumber of nodes to allocate
--ntasksTotal number of tasks (processes)
--cpus-per-taskCPU cores per task
--gres=gpu:NNumber of GPUs per node
--partitionPartition name (use all for default)
--timeMax job runtime in hh:mm:ss
--output / --errorPaths for stdout and stderr logs

Multi-Node Job Script

For distributed training across multiple nodes, create multinode_job.sh:

#!/bin/bash
#SBATCH --job-name=multi-node-training
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=04:00:00
#SBATCH --output=/mnt/shared/logs/%x_%j.out
#SBATCH --error=/mnt/shared/logs/%x_%j.err

export NCCL_DEBUG=INFO
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=12345

srun torchrun \
--nproc_per_node=8 \
--nnodes=$SLURM_JOB_NUM_NODES \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
/mnt/shared/train.py

Advanced Slurm Features

Job Arrays

Run the same job script multiple times with different parameters — useful for hyperparameter sweeps:

#SBATCH --array=1-10

Access the current task index in your script:

echo "Running task $SLURM_ARRAY_TASK_ID"

Job Dependencies

Run jobs in sequence — the second job starts only after the first completes successfully:

JOB1=$(sbatch preprocess.sh | awk '{print $4}')
sbatch --dependency=afterok:$JOB1 train_job.sh

Monitoring Jobs

View Logs

Stdout and stderr are written to the paths defined in --output and --error:

tail -f /mnt/shared/logs/single-node-training_JOBID.out

Check GPU Utilization

srun --jobid=JOB_ID nvidia-smi

Or connect to a worker and run:

watch -n 2 nvidia-smi

Troubleshooting

IssueCauseResolution
Job stuck in queue (PD state)Insufficient resources or unsatisfied dependencyCheck sinfo for node availability; verify dependency job status
Job fails immediatelyScript error or missing fileCheck the .err log; run scontrol show job JOB_ID for details
NCCL timeout on multi-nodeNetwork or firewall issue between nodesEnsure workers can reach each other; set NCCL_DEBUG=INFO
Disk fullLogs or checkpoints filling /mnt/sharedDelete old output files or increase storage quota

Resources