Slurm
Slurm (Simple Linux Utility for Resource Management) is a job scheduler designed for cluster environments. On TIR, a Slurm deployment provisions workers that form a Slurm cluster — you submit batch jobs using standard sbatch commands, and Slurm handles scheduling and resource allocation across your nodes.
Environment
Cluster Setup
- All nodes in the deployment share the same TIR-provided container image with CUDA, NCCL, and PyTorch pre-installed.
- The Slurm controller runs on the Master Worker. Submit jobs from there.
Connect to the Master Worker
ssh $hostname
All other workers are accessible from the master via SSH using their worker hostnames.
Shared Storage
Datasets, scripts, logs, and checkpoints must be placed in shared storage so all nodes can access them:
/mnt/shared
At least one storage volume (SFS, PFS, or Dataset) must be attached when creating a Slurm deployment.
Slurm Basics
| Concept | Description |
|---|---|
| Node | A compute instance in the cluster |
| Partition | A named group of nodes used for scheduling |
| Job | A task submitted to the scheduler for execution |
| Task | A single process within a job |
Essential Commands
sinfo # View node and partition status
squeue # List all queued and running jobs
squeue -u $USER # List your jobs only
sbatch job_script.sh # Submit a batch job
scancel JOB_ID # Cancel a job
scontrol show job JOB_ID # Detailed job information
Single-Node Job Script
For training on a single node with 8 GPUs, create train_job.sh:
#!/bin/bash
#SBATCH --job-name=single-node-training
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=02:00:00
#SBATCH --output=/mnt/shared/logs/%x_%j.out
#SBATCH --error=/mnt/shared/logs/%x_%j.err
export NCCL_DEBUG=INFO
export MASTER_ADDR=localhost
export MASTER_PORT=12345
srun torchrun --nproc_per_node=8 /mnt/shared/train.py
Submit it:
sbatch train_job.sh
Key Directives
| Directive | Description |
|---|---|
--nodes | Number of nodes to allocate |
--ntasks | Total number of tasks (processes) |
--cpus-per-task | CPU cores per task |
--gres=gpu:N | Number of GPUs per node |
--partition | Partition name (use all for default) |
--time | Max job runtime in hh:mm:ss |
--output / --error | Paths for stdout and stderr logs |
Multi-Node Job Script
For distributed training across multiple nodes, create multinode_job.sh:
#!/bin/bash
#SBATCH --job-name=multi-node-training
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=04:00:00
#SBATCH --output=/mnt/shared/logs/%x_%j.out
#SBATCH --error=/mnt/shared/logs/%x_%j.err
export NCCL_DEBUG=INFO
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=12345
srun torchrun \
--nproc_per_node=8 \
--nnodes=$SLURM_JOB_NUM_NODES \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
/mnt/shared/train.py
Advanced Slurm Features
Job Arrays
Run the same job script multiple times with different parameters — useful for hyperparameter sweeps:
#SBATCH --array=1-10
Access the current task index in your script:
echo "Running task $SLURM_ARRAY_TASK_ID"
Job Dependencies
Run jobs in sequence — the second job starts only after the first completes successfully:
JOB1=$(sbatch preprocess.sh | awk '{print $4}')
sbatch --dependency=afterok:$JOB1 train_job.sh
Monitoring Jobs
View Logs
Stdout and stderr are written to the paths defined in --output and --error:
tail -f /mnt/shared/logs/single-node-training_JOBID.out
Check GPU Utilization
srun --jobid=JOB_ID nvidia-smi
Or connect to a worker and run:
watch -n 2 nvidia-smi
Troubleshooting
| Issue | Cause | Resolution |
|---|---|---|
| Job stuck in queue (PD state) | Insufficient resources or unsatisfied dependency | Check sinfo for node availability; verify dependency job status |
| Job fails immediately | Script error or missing file | Check the .err log; run scontrol show job JOB_ID for details |
| NCCL timeout on multi-node | Network or firewall issue between nodes | Ensure workers can reach each other; set NCCL_DEBUG=INFO |
| Disk full | Logs or checkpoints filling /mnt/shared | Delete old output files or increase storage quota |