--- title: Slurm --- **Slurm (Simple Linux Utility for Resource Management)** is a job scheduler designed for cluster environments. On TIR, a Slurm deployment provisions workers that form a Slurm cluster — you submit batch jobs using standard `sbatch` commands, and Slurm handles scheduling and resource allocation across your nodes. --- ## Environment ### Cluster Setup * All nodes in the deployment share the same TIR-provided container image with CUDA, NCCL, and PyTorch pre-installed. * The Slurm controller runs on the **Master Worker**. Submit jobs from there. ### Connect to the Master Worker ```bash ssh $hostname ``` All other workers are accessible from the master via SSH using their worker hostnames. ### Shared Storage Datasets, scripts, logs, and checkpoints must be placed in shared storage so all nodes can access them: ```bash /mnt/shared ``` :::info At least one storage volume (SFS, PFS, or Dataset) must be attached when creating a Slurm deployment. ::: --- ## Slurm Basics | Concept | Description | |---------|-------------| | **Node** | A compute instance in the cluster | | **Partition** | A named group of nodes used for scheduling | | **Job** | A task submitted to the scheduler for execution | | **Task** | A single process within a job | ### Essential Commands ```bash sinfo # View node and partition status squeue # List all queued and running jobs squeue -u $USER # List your jobs only sbatch job_script.sh # Submit a batch job scancel JOB_ID # Cancel a job scontrol show job JOB_ID # Detailed job information ``` --- ## Single-Node Job Script For training on a single node with 8 GPUs, create `train_job.sh`: ```bash #!/bin/bash #SBATCH --job-name=single-node-training #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --gres=gpu:8 #SBATCH --partition=all #SBATCH --time=02:00:00 #SBATCH --output=/mnt/shared/logs/%x_%j.out #SBATCH --error=/mnt/shared/logs/%x_%j.err export NCCL_DEBUG=INFO export MASTER_ADDR=localhost export MASTER_PORT=12345 srun torchrun --nproc_per_node=8 /mnt/shared/train.py ``` Submit it: ```bash sbatch train_job.sh ``` ### Key Directives | Directive | Description | |-----------|-------------| | `--nodes` | Number of nodes to allocate | | `--ntasks` | Total number of tasks (processes) | | `--cpus-per-task` | CPU cores per task | | `--gres=gpu:N` | Number of GPUs per node | | `--partition` | Partition name (use `all` for default) | | `--time` | Max job runtime in `hh:mm:ss` | | `--output` / `--error` | Paths for stdout and stderr logs | --- ## Multi-Node Job Script For distributed training across multiple nodes, create `multinode_job.sh`: ```bash #!/bin/bash #SBATCH --job-name=multi-node-training #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --gres=gpu:8 #SBATCH --partition=all #SBATCH --time=04:00:00 #SBATCH --output=/mnt/shared/logs/%x_%j.out #SBATCH --error=/mnt/shared/logs/%x_%j.err export NCCL_DEBUG=INFO export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) export MASTER_PORT=12345 srun torchrun \ --nproc_per_node=8 \ --nnodes=$SLURM_JOB_NUM_NODES \ --node_rank=$SLURM_NODEID \ --master_addr=$MASTER_ADDR \ --master_port=$MASTER_PORT \ /mnt/shared/train.py ``` --- ## Advanced Slurm Features ### Job Arrays Run the same job script multiple times with different parameters — useful for hyperparameter sweeps: ```bash #SBATCH --array=1-10 ``` Access the current task index in your script: ```bash echo "Running task $SLURM_ARRAY_TASK_ID" ``` ### Job Dependencies Run jobs in sequence — the second job starts only after the first completes successfully: ```bash JOB1=$(sbatch preprocess.sh | awk '{print $4}') sbatch --dependency=afterok:$JOB1 train_job.sh ``` --- ## Monitoring Jobs ### View Logs Stdout and stderr are written to the paths defined in `--output` and `--error`: ```bash tail -f /mnt/shared/logs/single-node-training_JOBID.out ``` ### Check GPU Utilization ```bash srun --jobid=JOB_ID nvidia-smi ``` Or connect to a worker and run: ```bash watch -n 2 nvidia-smi ``` --- ## Troubleshooting | Issue | Cause | Resolution | |-------|-------|------------| | **Job stuck in queue (PD state)** | Insufficient resources or unsatisfied dependency | Check `sinfo` for node availability; verify dependency job status | | **Job fails immediately** | Script error or missing file | Check the `.err` log; run `scontrol show job JOB_ID` for details | | **NCCL timeout on multi-node** | Network or firewall issue between nodes | Ensure workers can reach each other; set `NCCL_DEBUG=INFO` | | **Disk full** | Logs or checkpoints filling `/mnt/shared` | Delete old output files or increase storage quota | --- ## Resources * [Official Slurm Documentation](https://slurm.schedmd.com/documentation.html) * [Slurm `sbatch` Reference](https://slurm.schedmd.com/sbatch.html) ---