Skip to main content

Slurm

Welcome to the SLURM guide for managing and running training jobs on your cluster.
This guide assumes a single-node setup with 8 GPUs.

Slurm Configuration

1. SLURM Basics

SLURM (Simple Linux Utility for Resource Management) is a powerful job scheduler used in cluster environments.
Key components include:

  • Nodes – Machines or compute instances in the cluster.
  • Partitions – Groups of nodes for scheduling jobs.
  • Jobs – Tasks submitted to the scheduler for execution.

2. Common SLURM Commands

Here are some essential SLURM commands you’ll use frequently:

Check node status

sinfo

Submit a job

sbatch job_script.sh

Monitor jobs

squeue

Cancel a job

scancel JOB_ID

Setting Up a Job Script

Example: Single-Node Training Job Script

Create a file named train_job.sh with the following content:

#!/bin/bash
#SBATCH --job-name=single-node-training
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

# Load modules
module load cuda/11.8
module load pytorch/2.0

# Environment variables
export NCCL_DEBUG=INFO
export MASTER_ADDR=localhost
export MASTER_PORT=12345

# Run the job
srun torchrun --nproc_per_node=8 train.py

Key Directives Explained

DirectiveDescription
--nodesNumber of nodes to use for the job (1 for single node).
--ntasksTotal number of tasks (1 for single node).
--gres=gpu:8Request 8 GPUs for training.
--partitionPartition or queue name (e.g., all).
--timeMaximum job runtime (hh:mm:ss).
--output, --errorDefine output and error log file names.

Advanced SLURM Features

1. Job Arrays

For batch processing or hyperparameter tuning, use a Job Array:

#SBATCH --array=1-10

This creates 10 jobs with IDs 1 through 10. You can access the array task ID with the variable:

$SLURM_ARRAY_TASK_ID

2. Job Dependencies

To run jobs sequentially or conditionally, use dependencies:

sbatch --dependency=afterok:JOB_ID job_script.sh

This ensures the new job starts only after the specified job completes successfully.


Submitting the Job

Submit your SLURM job using:

sbatch train_job.sh

Once submitted, SLURM will queue and execute your job based on resource availability.


Monitoring the Job

Check Job Status

View jobs currently in the queue:

squeue -u $USER

View Logs

After the job starts, output logs will appear in the .out file defined in your script:

less single-node-training_JOBID.out

You can also check the .err file for any errors.


Troubleshooting

1. Common Issues

  • Insufficient Resources – Ensure the requested number of GPUs are available in the chosen partition.
  • Job Stuck in Queue – Check partition status using sinfo or verify dependency constraints.

2. Debugging Tips

View detailed job information:

scontrol show job JOBID

Check the .err file for error messages and SLURM-specific issues.


- Official SLURM Documentation