---
title: PyTorch Distributed
---

**PyTorch Distributed Data Parallel (DDP)** lets you train models across multiple GPUs on a single node by running one process per GPU and synchronizing gradients after each backward pass. On TIR, your Training Cluster node comes fully pre-configured — drivers, CUDA, NCCL, and PyTorch are ready to use.

---

## Environment

Each TIR Training Cluster node comes pre-configured with:

* **PyTorch**, **CUDA**, and **NCCL** installed and optimized
* GPU drivers and high-bandwidth interconnects (NVLink or PCIe)
* Identical software environments across all GPUs

### Connect to the Node

```bash
ssh $hostname
```

### Shared Storage

All datasets, checkpoints, and logs should be written to the shared directory so they persist after the deployment ends:

```bash
/mnt/shared
```

---

## Training Guide

### Step 1: Install Dependencies

The TIR-provided image includes PyTorch. If you are using a custom environment, install it manually:

```bash
pip install torch torchvision torchaudio
```

---

### Step 2: Write a Distributed Training Script

Save the following as `train.py`. It initializes NCCL, wraps the model in DDP, and runs a training loop across all available GPUs:

```python
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP


def setup():
    dist.init_process_group(backend="nccl")


def cleanup():
    dist.destroy_process_group()


def main():
    setup()

    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    # Model, optimizer, loss
    model = nn.Linear(10, 1).cuda()
    ddp_model = DDP(model, device_ids=[local_rank])

    optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)
    criterion = nn.MSELoss()

    for epoch in range(5):
        optimizer.zero_grad()
        inputs = torch.randn(32, 10).cuda()
        targets = torch.randn(32, 1).cuda()
        outputs = ddp_model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        if local_rank == 0:
            print(f"Epoch {epoch} | Loss: {loss.item():.4f}")

    cleanup()


if __name__ == "__main__":
    main()
```

---

### Step 3: Launch Training

Use `torchrun` to spawn one process per GPU automatically:

```bash
torchrun --nproc_per_node=4 train.py
```

| Flag | Description |
|------|-------------|
| `--nproc_per_node` | Number of GPUs on the node (e.g., `4` or `8`) |
| `train.py` | Your training script |

---

## Data Management

### Import Datasets

Place datasets in shared storage before launching training:

```bash
/mnt/shared/datasets
```

### Save Checkpoints

Save checkpoints from the rank-0 process only to avoid write conflicts:

```python
if local_rank == 0:
    torch.save(model.state_dict(), "/mnt/shared/checkpoints/model_epoch_5.pt")
```

---

## Monitoring

### TensorBoard

Write logs to shared storage and access TensorBoard remotely via SSH port forwarding:

```bash
# On the worker node
tensorboard --logdir /mnt/shared/logs --port 6006

# On your local machine
ssh -L 6006:localhost:6006 $hostname
```

### Weights & Biases

```bash
pip install wandb
```

```python
import wandb
wandb.init(project="pytorch-ddp", mode="offline")
wandb.log({"loss": loss.item()})
```

### GPU & System Utilization

```bash
watch -n 2 nvidia-smi   # GPU utilization
htop                    # CPU and memory
```

---

## Mixed Precision Training

Enable Automatic Mixed Precision (AMP) to reduce GPU memory usage and speed up training:

```python
from torch.amp import autocast, GradScaler

scaler = GradScaler(device="cuda")

for epoch in range(5):
    optimizer.zero_grad()
    with autocast(device_type="cuda"):
        outputs = ddp_model(inputs)
        loss = criterion(outputs, targets)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

---

## Troubleshooting

| Issue | Cause | Resolution |
|-------|-------|------------|
| **CUDA Out of Memory** | Batch size too large for GPU memory | Reduce batch size or enable AMP |
| **NCCL Timeout** | GPU communication failure | Confirm all GPUs are visible via `nvidia-smi` |
| **Disk Full** | Checkpoints or logs filling `/mnt/shared` | Delete old files or increase storage quota |
| **Process group init failure** | `LOCAL_RANK` not set | Use `torchrun` instead of `python` to launch |

---

## FAQ

**Q: Why use DDP instead of DataParallel?**

DDP spawns one process per GPU and communicates via NCCL, giving near-linear speedups. DataParallel runs in a single process and is slower due to Python GIL overhead and less efficient gradient synchronization.

**Q: How do I save a checkpoint safely with multiple workers?**

Only the rank-0 process should write checkpoints. Wrap your save call with `if dist.get_rank() == 0:` to avoid write conflicts from other workers.

**Q: Where are logs and checkpoints stored?**

Under `/mnt/shared/logs` and `/mnt/shared/checkpoints` by convention. Use these paths to ensure data persists after the deployment ends.


---