Slurm

Welcome to the SLURM guide for managing and running distributed training jobs on your cluster. This guide assumes a 2-node setup with 8 GPUs per node.

Slurm Configuration

1. SLURM Basics

SLURM is a job scheduler for clusters. Key components:

Nodes: Machines in the cluster.
Partitions: Groups of nodes.
Jobs: Tasks submitted to the scheduler.

2. SLURM Commands

Here are some essential commands:

Check node status:

sinfo

Submit a job:

sbatch job_script.sh

Monitor jobs:

squeue

Cancel a job:

scancel JOB_ID

Setting Up a Job Script

Example: Distributed Training Job Script

Create a file named distributed_job.sh:

#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
\
#Load modules
module load cuda/11.8
module load pytorch/2.0
\
#Environment variables
export NCCL_DEBUG=INFO
export MASTER_ADDR=$(scontrol show hostname {SLURM_NODELIST} | head -n 1)
export MASTER_PORT=12345
\
#Run the job
srun torchrun --nproc_per_node=8 --nnodes=2 --node_rank={SLURM_PROCID} train.py`}

Key Directives Explained

--nodes: Number of nodes for the job.
--ntasks: Total number of tasks (1 per node for PyTorch distributed).
--gres=gpu:8: Request 8 GPUs per node.
--partition: Partition name (e.g., all).
--time: Maximum runtime (hh:mm:ss).
--output and --error: Log output and error files.

Advanced SLURM Features

1. Job Array

For batch processing, use a job array:

#SBATCH--array=1-10

This creates 10 jobs with IDs 1 through 10. Access the array task ID with $SLURM_ARRAY_TASK_ID.

2. Job Dependencies

Submit jobs that depend on others:

sbatch--dependency=afterok:JOB_ID job_script.sh

Submitting the Job

Submit the job script to SLURM:

sbatch distributed_job.sh

Monitoring the Job

Check Job Status

View your job in the queue:

squeue - u $USER

View Logs

After the job starts, logs will appear in the specified output file:

less distributed-training_JOBID.out

Troubleshooting

1. Common Errors

Insufficient Resources: Ensure enough GPUs are available in the partition.
Job Stuck in Queue: Check partition availability using sinfo.

2. Debugging Tips

Enable detailed SLURM logging:

scontrol show job JOBID

Check for errors in .err files.

Additional Resources

Happy SLURMing! 🚀

Slurm Configuration​

1. SLURM Basics​

2. SLURM Commands​

Setting Up a Job Script​

Example: Distributed Training Job Script​

Key Directives Explained​

Advanced SLURM Features​

1. Job Array​

2. Job Dependencies​

Submitting the Job​

Monitoring the Job​

Check Job Status​

View Logs​

Troubleshooting​

1. Common Errors​

2. Debugging Tips​

Additional Resources​