PyTorch Lightning

PyTorch Lightning is a high-level framework built on top of PyTorch that removes boilerplate from distributed training. It handles device placement, gradient synchronization, and checkpointing automatically — letting you focus on the model rather than the training loop. On TIR, a Training Cluster node comes pre-configured so you can start training immediately.

Environment

Cluster Setup

Pre-installed: CUDA, NCCL, PyTorch, and PyTorch Lightning are available in the TIR-provided image.
Single-node: PyTorch Lightning deployments on TIR run across multiple GPUs on a single node.

Connect to the Node

ssh $hostname

Shared Storage

Use the shared directory for datasets, checkpoints, and logs so data persists after the deployment ends:

/mnt/shared

Training Guide

Step 1: Install Dependencies

The TIR-provided image includes PyTorch Lightning. To install manually in a custom environment:

pip install lightning

Step 2: Define a LightningModule

Create a LightningModule encapsulating your model, loss, optimizer, and training logic:

import torch
import torch.nn as nn
import lightning as L


class SimpleModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = nn.Linear(10, 1)
        self.criterion = nn.MSELoss()

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        inputs, targets = batch
        outputs = self(inputs)
        loss = self.criterion(outputs, targets)
        self.log("train_loss", loss, on_step=True, prog_bar=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=0.01)

Step 3: Configure the Trainer

Set strategy="ddp" and devices to the number of GPUs on your node:

from torch.utils.data import DataLoader, TensorDataset

# Dummy dataset
dataset = TensorDataset(torch.randn(256, 10), torch.randn(256, 1))
dataloader = DataLoader(dataset, batch_size=32)

model = SimpleModel()

trainer = L.Trainer(
    accelerator="gpu",
    devices=8,          # Number of GPUs on the node
    strategy="ddp",
    num_nodes=1,
    max_epochs=10,
    default_root_dir="/mnt/shared/checkpoints",
)

trainer.fit(model, dataloader)

Step 4: Launch Training

python train.py

Lightning automatically uses all configured GPUs via torchrun internally when strategy="ddp" is set.

Data Management

Import Datasets

/mnt/shared/datasets

Save Checkpoints

Lightning saves checkpoints automatically to default_root_dir. To save manually:

trainer.save_checkpoint("/mnt/shared/checkpoints/model_final.ckpt")

To resume from a checkpoint:

trainer.fit(model, dataloader, ckpt_path="/mnt/shared/checkpoints/model_final.ckpt")

Monitoring

TensorBoard

Lightning logs metrics to TensorBoard by default:

# On the worker node
tensorboard --logdir /mnt/shared/checkpoints/lightning_logs --port 6006

# On your local machine
ssh -L 6006:localhost:6006 $hostname

Weights & Biases

from lightning.pytorch.loggers import WandbLogger

logger = WandbLogger(project="lightning-training", save_dir="/mnt/shared/wandb")
trainer = L.Trainer(logger=logger, ...)

System Monitoring

nvidia-smi    # GPU utilization
htop          # CPU and memory
df -h         # Disk usage

Troubleshooting

Issue	Cause	Resolution
CUDA Out of Memory	Batch size too large for GPU memory	Reduce `batch_size` or enable `precision="16-mixed"` in Trainer
NCCL Initialization Error	Conflicting GPU processes	Kill orphaned processes: `sudo fuser -v /dev/nvidia*`
Disk Full	Checkpoints filling `/mnt/shared`	Delete old checkpoint files or adjust `save_top_k` in ModelCheckpoint
SSH Connection Failed	Port 22 not open in security group	Verify the attached security group allows inbound TCP on port 22

FAQ

Q: How do I enable mixed precision training?

Set precision="16-mixed" in the Trainer:

trainer = L.Trainer(accelerator="gpu", devices=8, precision="16-mixed")

Q: How do I use multiple GPUs on a single node?

Set devices=N where N is the number of available GPUs (e.g., devices=8), and strategy="ddp".

Q: Where are checkpoints and logs stored?

By default, under default_root_dir (set to /mnt/shared/checkpoints in the example above) inside a lightning_logs/ subdirectory.

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

Environment​

Cluster Setup​

Connect to the Node​

Shared Storage​

Training Guide​

Step 1: Install Dependencies​

Step 2: Define a LightningModule​

Step 3: Configure the Trainer​

Step 4: Launch Training​

Data Management​

Import Datasets​

Save Checkpoints​

Monitoring​

TensorBoard​

Weights & Biases​

System Monitoring​

Troubleshooting​

FAQ​

Environment

Cluster Setup

Connect to the Node

Shared Storage

Training Guide

Step 1: Install Dependencies

Step 2: Define a LightningModule

Step 3: Configure the Trainer

Step 4: Launch Training

Data Management

Import Datasets

Save Checkpoints

Monitoring

TensorBoard

Weights & Biases

System Monitoring

Troubleshooting

FAQ