Slurm
Welcome to the SLURM guide for managing and running training jobs on your cluster.
This guide assumes a single-node setup with 8 GPUs.
Slurm Configuration
1. SLURM Basics
SLURM (Simple Linux Utility for Resource Management) is a powerful job scheduler used in cluster environments.
Key components include:
- Nodes – Machines or compute instances in the cluster.
- Partitions – Groups of nodes for scheduling jobs.
- Jobs – Tasks submitted to the scheduler for execution.
2. Common SLURM Commands
Here are some essential SLURM commands you’ll use frequently:
Check node status
sinfo
Submit a job
sbatch job_script.sh
Monitor jobs
squeue
Cancel a job
scancel JOB_ID
Setting Up a Job Script
Example: Single-Node Training Job Script
Create a file named train_job.sh with the following content:
#!/bin/bash
#SBATCH --job-name=single-node-training
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:8
#SBATCH --partition=all
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
# Load modules
module load cuda/11.8
module load pytorch/2.0
# Environment variables
export NCCL_DEBUG=INFO
export MASTER_ADDR=localhost
export MASTER_PORT=12345
# Run the job
srun torchrun --nproc_per_node=8 train.py
Key Directives Explained
| Directive | Description |
|---|---|
--nodes | Number of nodes to use for the job (1 for single node). |
--ntasks | Total number of tasks (1 for single node). |
--gres=gpu:8 | Request 8 GPUs for training. |
--partition | Partition or queue name (e.g., all). |
--time | Maximum job runtime (hh:mm:ss). |
--output, --error | Define output and error log file names. |
Advanced SLURM Features
1. Job Arrays
For batch processing or hyperparameter tuning, use a Job Array:
#SBATCH --array=1-10
This creates 10 jobs with IDs 1 through 10. You can access the array task ID with the variable:
$SLURM_ARRAY_TASK_ID
2. Job Dependencies
To run jobs sequentially or conditionally, use dependencies:
sbatch --dependency=afterok:JOB_ID job_script.sh
This ensures the new job starts only after the specified job completes successfully.
Submitting the Job
Submit your SLURM job using:
sbatch train_job.sh
Once submitted, SLURM will queue and execute your job based on resource availability.
Monitoring the Job
Check Job Status
View jobs currently in the queue:
squeue -u $USER
View Logs
After the job starts, output logs will appear in the .out file defined in your script:
less single-node-training_JOBID.out
You can also check the .err file for any errors.
Troubleshooting
1. Common Issues
- Insufficient Resources – Ensure the requested number of GPUs are available in the chosen partition.
- Job Stuck in Queue – Check partition status using
sinfoor verify dependency constraints.
2. Debugging Tips
View detailed job information:
scontrol show job JOBID
Check the .err file for error messages and SLURM-specific issues.