Skip to main content

PyTorch Lightning

PyTorch Lightning is a high-level framework built on top of PyTorch that removes boilerplate from distributed training. It handles device placement, gradient synchronization, and checkpointing automatically — letting you focus on the model rather than the training loop. On TIR, a Training Cluster node comes pre-configured so you can start training immediately.


Environment

Cluster Setup

  • Pre-installed: CUDA, NCCL, PyTorch, and PyTorch Lightning are available in the TIR-provided image.
  • Single-node: PyTorch Lightning deployments on TIR run across multiple GPUs on a single node.

Connect to the Node

ssh $hostname

Shared Storage

Use the shared directory for datasets, checkpoints, and logs so data persists after the deployment ends:

/mnt/shared

Training Guide

Step 1: Install Dependencies

The TIR-provided image includes PyTorch Lightning. To install manually in a custom environment:

pip install lightning

Step 2: Define a LightningModule

Create a LightningModule encapsulating your model, loss, optimizer, and training logic:

import torch
import torch.nn as nn
import lightning as L


class SimpleModel(L.LightningModule):
def __init__(self):
super().__init__()
self.layer = nn.Linear(10, 1)
self.criterion = nn.MSELoss()

def forward(self, x):
return self.layer(x)

def training_step(self, batch, batch_idx):
inputs, targets = batch
outputs = self(inputs)
loss = self.criterion(outputs, targets)
self.log("train_loss", loss, on_step=True, prog_bar=True)
return loss

def configure_optimizers(self):
return torch.optim.SGD(self.parameters(), lr=0.01)

Step 3: Configure the Trainer

Set strategy="ddp" and devices to the number of GPUs on your node:

from torch.utils.data import DataLoader, TensorDataset

# Dummy dataset
dataset = TensorDataset(torch.randn(256, 10), torch.randn(256, 1))
dataloader = DataLoader(dataset, batch_size=32)

model = SimpleModel()

trainer = L.Trainer(
accelerator="gpu",
devices=8, # Number of GPUs on the node
strategy="ddp",
num_nodes=1,
max_epochs=10,
default_root_dir="/mnt/shared/checkpoints",
)

trainer.fit(model, dataloader)

Step 4: Launch Training

python train.py

Lightning automatically uses all configured GPUs via torchrun internally when strategy="ddp" is set.


Data Management

Import Datasets

/mnt/shared/datasets

Save Checkpoints

Lightning saves checkpoints automatically to default_root_dir. To save manually:

trainer.save_checkpoint("/mnt/shared/checkpoints/model_final.ckpt")

To resume from a checkpoint:

trainer.fit(model, dataloader, ckpt_path="/mnt/shared/checkpoints/model_final.ckpt")

Monitoring

TensorBoard

Lightning logs metrics to TensorBoard by default:

# On the worker node
tensorboard --logdir /mnt/shared/checkpoints/lightning_logs --port 6006

# On your local machine
ssh -L 6006:localhost:6006 $hostname

Weights & Biases

from lightning.pytorch.loggers import WandbLogger

logger = WandbLogger(project="lightning-training", save_dir="/mnt/shared/wandb")
trainer = L.Trainer(logger=logger, ...)

System Monitoring

nvidia-smi    # GPU utilization
htop # CPU and memory
df -h # Disk usage

Troubleshooting

IssueCauseResolution
CUDA Out of MemoryBatch size too large for GPU memoryReduce batch_size or enable precision="16-mixed" in Trainer
NCCL Initialization ErrorConflicting GPU processesKill orphaned processes: sudo fuser -v /dev/nvidia*
Disk FullCheckpoints filling /mnt/sharedDelete old checkpoint files or adjust save_top_k in ModelCheckpoint
SSH Connection FailedPort 22 not open in security groupVerify the attached security group allows inbound TCP on port 22

FAQ

Q: How do I enable mixed precision training?

Set precision="16-mixed" in the Trainer:

trainer = L.Trainer(accelerator="gpu", devices=8, precision="16-mixed")

Q: How do I use multiple GPUs on a single node?

Set devices=N where N is the number of available GPUs (e.g., devices=8), and strategy="ddp".

Q: Where are checkpoints and logs stored?

By default, under default_root_dir (set to /mnt/shared/checkpoints in the example above) inside a lightning_logs/ subdirectory.