PyTorch Lightning
PyTorch Lightning is a high-level framework built on top of PyTorch that removes boilerplate from distributed training. It handles device placement, gradient synchronization, and checkpointing automatically — letting you focus on the model rather than the training loop. On TIR, a Training Cluster node comes pre-configured so you can start training immediately.
Environment
Cluster Setup
- Pre-installed: CUDA, NCCL, PyTorch, and PyTorch Lightning are available in the TIR-provided image.
- Single-node: PyTorch Lightning deployments on TIR run across multiple GPUs on a single node.
Connect to the Node
ssh $hostname
Shared Storage
Use the shared directory for datasets, checkpoints, and logs so data persists after the deployment ends:
/mnt/shared
Training Guide
Step 1: Install Dependencies
The TIR-provided image includes PyTorch Lightning. To install manually in a custom environment:
pip install lightning
Step 2: Define a LightningModule
Create a LightningModule encapsulating your model, loss, optimizer, and training logic:
import torch
import torch.nn as nn
import lightning as L
class SimpleModel(L.LightningModule):
def __init__(self):
super().__init__()
self.layer = nn.Linear(10, 1)
self.criterion = nn.MSELoss()
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
inputs, targets = batch
outputs = self(inputs)
loss = self.criterion(outputs, targets)
self.log("train_loss", loss, on_step=True, prog_bar=True)
return loss
def configure_optimizers(self):
return torch.optim.SGD(self.parameters(), lr=0.01)
Step 3: Configure the Trainer
Set strategy="ddp" and devices to the number of GPUs on your node:
from torch.utils.data import DataLoader, TensorDataset
# Dummy dataset
dataset = TensorDataset(torch.randn(256, 10), torch.randn(256, 1))
dataloader = DataLoader(dataset, batch_size=32)
model = SimpleModel()
trainer = L.Trainer(
accelerator="gpu",
devices=8, # Number of GPUs on the node
strategy="ddp",
num_nodes=1,
max_epochs=10,
default_root_dir="/mnt/shared/checkpoints",
)
trainer.fit(model, dataloader)
Step 4: Launch Training
python train.py
Lightning automatically uses all configured GPUs via torchrun internally when strategy="ddp" is set.
Data Management
Import Datasets
/mnt/shared/datasets
Save Checkpoints
Lightning saves checkpoints automatically to default_root_dir. To save manually:
trainer.save_checkpoint("/mnt/shared/checkpoints/model_final.ckpt")
To resume from a checkpoint:
trainer.fit(model, dataloader, ckpt_path="/mnt/shared/checkpoints/model_final.ckpt")
Monitoring
TensorBoard
Lightning logs metrics to TensorBoard by default:
# On the worker node
tensorboard --logdir /mnt/shared/checkpoints/lightning_logs --port 6006
# On your local machine
ssh -L 6006:localhost:6006 $hostname
Weights & Biases
from lightning.pytorch.loggers import WandbLogger
logger = WandbLogger(project="lightning-training", save_dir="/mnt/shared/wandb")
trainer = L.Trainer(logger=logger, ...)
System Monitoring
nvidia-smi # GPU utilization
htop # CPU and memory
df -h # Disk usage
Troubleshooting
| Issue | Cause | Resolution |
|---|---|---|
| CUDA Out of Memory | Batch size too large for GPU memory | Reduce batch_size or enable precision="16-mixed" in Trainer |
| NCCL Initialization Error | Conflicting GPU processes | Kill orphaned processes: sudo fuser -v /dev/nvidia* |
| Disk Full | Checkpoints filling /mnt/shared | Delete old checkpoint files or adjust save_top_k in ModelCheckpoint |
| SSH Connection Failed | Port 22 not open in security group | Verify the attached security group allows inbound TCP on port 22 |
FAQ
Q: How do I enable mixed precision training?
Set precision="16-mixed" in the Trainer:
trainer = L.Trainer(accelerator="gpu", devices=8, precision="16-mixed")
Q: How do I use multiple GPUs on a single node?
Set devices=N where N is the number of available GPUs (e.g., devices=8), and strategy="ddp".
Q: Where are checkpoints and logs stored?
By default, under default_root_dir (set to /mnt/shared/checkpoints in the example above) inside a lightning_logs/ subdirectory.