Skip to main content

Slurm

Welcome to the SLURM guide for managing and running distributed training jobs on your cluster. This guide assumes a 2-node setup with 8 GPUs per node.


Slurm Configuration

1. SLURM Basics

SLURM is a job scheduler for clusters. Key components:

  • Nodes: Machines in the cluster.
  • Partitions: Groups of nodes.
  • Jobs: Tasks submitted to the scheduler.

2. SLURM Commands

Here are some essential commands:

  • Check node status:
sinfo
  • Submit a job:
sbatch job_script.sh
  • Monitor jobs:
squeue
  • Cancel a job:
scancel JOB_ID

Setting Up a Job Script

Example: Distributed Training Job Script

Create a file named distributed_job.sh:

#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
\
#Load modules
module load cuda/11.8
module load pytorch/2.0
\
#Environment variables
export NCCL_DEBUG=INFO
export MASTER_ADDR=$(scontrol show hostname {SLURM_NODELIST} | head -n 1)
export MASTER_PORT=12345
\
#Run the job
srun torchrun --nproc_per_node=8 --nnodes=2 --node_rank={SLURM_PROCID} train.py`}

Key Directives Explained

  • --nodes: Number of nodes for the job.
  • --ntasks: Total number of tasks (1 per node for PyTorch distributed).
  • --gres=gpu:8: Request 8 GPUs per node.
  • --partition: Partition name (e.g., all).
  • --time: Maximum runtime (hh:mm:ss).
  • --output and --error: Log output and error files.

Advanced SLURM Features

1. Job Array

For batch processing, use a job array:

#SBATCH--array=1-10

This creates 10 jobs with IDs 1 through 10. Access the array task ID with $SLURM_ARRAY_TASK_ID.

2. Job Dependencies

Submit jobs that depend on others:

sbatch--dependency=afterok:JOB_ID job_script.sh

Submitting the Job

Submit the job script to SLURM:

sbatch distributed_job.sh

Monitoring the Job

Check Job Status

View your job in the queue:

squeue - u $USER

View Logs

After the job starts, logs will appear in the specified output file:

less distributed-training_JOBID.out

Troubleshooting

1. Common Errors

  • Insufficient Resources: Ensure enough GPUs are available in the partition.
  • Job Stuck in Queue: Check partition availability using sinfo.

2. Debugging Tips

Enable detailed SLURM logging:

scontrol show job JOBID

Check for errors in .err files.


Additional Resources

Happy SLURMing! 🚀