---
title: PyTorch Lightning
---

**PyTorch Lightning** is a high-level framework built on top of PyTorch that removes boilerplate from distributed training. It handles device placement, gradient synchronization, and checkpointing automatically — letting you focus on the model rather than the training loop. On TIR, a Training Cluster node comes pre-configured so you can start training immediately.

---

## Environment

### Cluster Setup

* **Pre-installed:** CUDA, NCCL, PyTorch, and PyTorch Lightning are available in the TIR-provided image.
* **Single-node:** PyTorch Lightning deployments on TIR run across multiple GPUs on a single node.

### Connect to the Node

```bash
ssh $hostname
```

### Shared Storage

Use the shared directory for datasets, checkpoints, and logs so data persists after the deployment ends:

```bash
/mnt/shared
```

---

## Training Guide

### Step 1: Install Dependencies

The TIR-provided image includes PyTorch Lightning. To install manually in a custom environment:

```bash
pip install lightning
```

---

### Step 2: Define a LightningModule

Create a `LightningModule` encapsulating your model, loss, optimizer, and training logic:

```python
import torch
import torch.nn as nn
import lightning as L


class SimpleModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = nn.Linear(10, 1)
        self.criterion = nn.MSELoss()

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        inputs, targets = batch
        outputs = self(inputs)
        loss = self.criterion(outputs, targets)
        self.log("train_loss", loss, on_step=True, prog_bar=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=0.01)
```

---

### Step 3: Configure the Trainer

Set `strategy="ddp"` and `devices` to the number of GPUs on your node:

```python
from torch.utils.data import DataLoader, TensorDataset

# Dummy dataset
dataset = TensorDataset(torch.randn(256, 10), torch.randn(256, 1))
dataloader = DataLoader(dataset, batch_size=32)

model = SimpleModel()

trainer = L.Trainer(
    accelerator="gpu",
    devices=8,          # Number of GPUs on the node
    strategy="ddp",
    num_nodes=1,
    max_epochs=10,
    default_root_dir="/mnt/shared/checkpoints",
)

trainer.fit(model, dataloader)
```

---

### Step 4: Launch Training

```bash
python train.py
```

Lightning automatically uses all configured GPUs via `torchrun` internally when `strategy="ddp"` is set.

---

## Data Management

### Import Datasets

```bash
/mnt/shared/datasets
```

### Save Checkpoints

Lightning saves checkpoints automatically to `default_root_dir`. To save manually:

```python
trainer.save_checkpoint("/mnt/shared/checkpoints/model_final.ckpt")
```

To resume from a checkpoint:

```python
trainer.fit(model, dataloader, ckpt_path="/mnt/shared/checkpoints/model_final.ckpt")
```

---

## Monitoring

### TensorBoard

Lightning logs metrics to TensorBoard by default:

```bash
# On the worker node
tensorboard --logdir /mnt/shared/checkpoints/lightning_logs --port 6006

# On your local machine
ssh -L 6006:localhost:6006 $hostname
```

### Weights & Biases

```python
from lightning.pytorch.loggers import WandbLogger

logger = WandbLogger(project="lightning-training", save_dir="/mnt/shared/wandb")
trainer = L.Trainer(logger=logger, ...)
```

### System Monitoring

```bash
nvidia-smi    # GPU utilization
htop          # CPU and memory
df -h         # Disk usage
```

---

## Troubleshooting

| Issue | Cause | Resolution |
|-------|-------|------------|
| **CUDA Out of Memory** | Batch size too large for GPU memory | Reduce `batch_size` or enable `precision="16-mixed"` in Trainer |
| **NCCL Initialization Error** | Conflicting GPU processes | Kill orphaned processes: `sudo fuser -v /dev/nvidia*` |
| **Disk Full** | Checkpoints filling `/mnt/shared` | Delete old checkpoint files or adjust `save_top_k` in ModelCheckpoint |
| **SSH Connection Failed** | Port 22 not open in security group | Verify the attached security group allows inbound TCP on port 22 |

---

## FAQ

**Q: How do I enable mixed precision training?**

Set `precision="16-mixed"` in the Trainer:

```python
trainer = L.Trainer(accelerator="gpu", devices=8, precision="16-mixed")
```

**Q: How do I use multiple GPUs on a single node?**

Set `devices=N` where N is the number of available GPUs (e.g., `devices=8`), and `strategy="ddp"`.

**Q: Where are checkpoints and logs stored?**

By default, under `default_root_dir` (set to `/mnt/shared/checkpoints` in the example above) inside a `lightning_logs/` subdirectory.


---