Slurm
Welcome to the SLURM guide for managing and running distributed training jobs on your cluster. This guide assumes a 2-node setup with 8 GPUs per node.
Slurm Configuration
1. SLURM Basics
SLURM is a job scheduler for clusters. Key components:
- Nodes: Machines in the cluster.
- Partitions: Groups of nodes.
- Jobs: Tasks submitted to the scheduler.
2. SLURM Commands
Here are some essential commands:
- Check node status:
sinfo
- Submit a job:
sbatch job_script.sh
- Monitor jobs:
squeue
- Cancel a job:
scancel JOB_ID
Setting Up a Job Script
Example: Distributed Training Job Script
Create a file named distributed_job.sh
:
#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
\
#Load modules
module load cuda/11.8
module load pytorch/2.0
\
#Environment variables
export NCCL_DEBUG=INFO
export MASTER_ADDR=$(scontrol show hostname {SLURM_NODELIST} | head -n 1)
export MASTER_PORT=12345
\
#Run the job
srun torchrun --nproc_per_node=8 --nnodes=2 --node_rank={SLURM_PROCID} train.py`}
Key Directives Explained
--nodes
: Number of nodes for the job.--ntasks
: Total number of tasks (1 per node for PyTorch distributed).--gres=gpu:8
: Request 8 GPUs per node.--partition
: Partition name (e.g.,all
).--time
: Maximum runtime (hh:mm:ss).--output
and--error
: Log output and error files.
Advanced SLURM Features
1. Job Array
For batch processing, use a job array:
#SBATCH--array=1-10
This creates 10 jobs with IDs 1 through 10. Access the array task ID with $SLURM_ARRAY_TASK_ID
.
2. Job Dependencies
Submit jobs that depend on others:
sbatch--dependency=afterok:JOB_ID job_script.sh
Submitting the Job
Submit the job script to SLURM:
sbatch distributed_job.sh
Monitoring the Job
Check Job Status
View your job in the queue:
squeue - u $USER
View Logs
After the job starts, logs will appear in the specified output file:
less distributed-training_JOBID.out
Troubleshooting
1. Common Errors
- Insufficient Resources: Ensure enough GPUs are available in the partition.
- Job Stuck in Queue: Check partition availability using
sinfo
.
2. Debugging Tips
Enable detailed SLURM logging:
scontrol show job JOBID
Check for errors in .err
files.
Additional Resources
Happy SLURMing! 🚀