--- title: PyTorch Lightning --- **PyTorch Lightning** is a high-level framework built on top of PyTorch that removes boilerplate from distributed training. It handles device placement, gradient synchronization, and checkpointing automatically — letting you focus on the model rather than the training loop. On TIR, a Training Cluster node comes pre-configured so you can start training immediately. --- ## Environment ### Cluster Setup * **Pre-installed:** CUDA, NCCL, PyTorch, and PyTorch Lightning are available in the TIR-provided image. * **Single-node:** PyTorch Lightning deployments on TIR run across multiple GPUs on a single node. ### Connect to the Node ```bash ssh $hostname ``` ### Shared Storage Use the shared directory for datasets, checkpoints, and logs so data persists after the deployment ends: ```bash /mnt/shared ``` --- ## Training Guide ### Step 1: Install Dependencies The TIR-provided image includes PyTorch Lightning. To install manually in a custom environment: ```bash pip install lightning ``` --- ### Step 2: Define a LightningModule Create a `LightningModule` encapsulating your model, loss, optimizer, and training logic: ```python import torch import torch.nn as nn import lightning as L class SimpleModel(L.LightningModule): def __init__(self): super().__init__() self.layer = nn.Linear(10, 1) self.criterion = nn.MSELoss() def forward(self, x): return self.layer(x) def training_step(self, batch, batch_idx): inputs, targets = batch outputs = self(inputs) loss = self.criterion(outputs, targets) self.log("train_loss", loss, on_step=True, prog_bar=True) return loss def configure_optimizers(self): return torch.optim.SGD(self.parameters(), lr=0.01) ``` --- ### Step 3: Configure the Trainer Set `strategy="ddp"` and `devices` to the number of GPUs on your node: ```python from torch.utils.data import DataLoader, TensorDataset # Dummy dataset dataset = TensorDataset(torch.randn(256, 10), torch.randn(256, 1)) dataloader = DataLoader(dataset, batch_size=32) model = SimpleModel() trainer = L.Trainer( accelerator="gpu", devices=8, # Number of GPUs on the node strategy="ddp", num_nodes=1, max_epochs=10, default_root_dir="/mnt/shared/checkpoints", ) trainer.fit(model, dataloader) ``` --- ### Step 4: Launch Training ```bash python train.py ``` Lightning automatically uses all configured GPUs via `torchrun` internally when `strategy="ddp"` is set. --- ## Data Management ### Import Datasets ```bash /mnt/shared/datasets ``` ### Save Checkpoints Lightning saves checkpoints automatically to `default_root_dir`. To save manually: ```python trainer.save_checkpoint("/mnt/shared/checkpoints/model_final.ckpt") ``` To resume from a checkpoint: ```python trainer.fit(model, dataloader, ckpt_path="/mnt/shared/checkpoints/model_final.ckpt") ``` --- ## Monitoring ### TensorBoard Lightning logs metrics to TensorBoard by default: ```bash # On the worker node tensorboard --logdir /mnt/shared/checkpoints/lightning_logs --port 6006 # On your local machine ssh -L 6006:localhost:6006 $hostname ``` ### Weights & Biases ```python from lightning.pytorch.loggers import WandbLogger logger = WandbLogger(project="lightning-training", save_dir="/mnt/shared/wandb") trainer = L.Trainer(logger=logger, ...) ``` ### System Monitoring ```bash nvidia-smi # GPU utilization htop # CPU and memory df -h # Disk usage ``` --- ## Troubleshooting | Issue | Cause | Resolution | |-------|-------|------------| | **CUDA Out of Memory** | Batch size too large for GPU memory | Reduce `batch_size` or enable `precision="16-mixed"` in Trainer | | **NCCL Initialization Error** | Conflicting GPU processes | Kill orphaned processes: `sudo fuser -v /dev/nvidia*` | | **Disk Full** | Checkpoints filling `/mnt/shared` | Delete old checkpoint files or adjust `save_top_k` in ModelCheckpoint | | **SSH Connection Failed** | Port 22 not open in security group | Verify the attached security group allows inbound TCP on port 22 | --- ## FAQ **Q: How do I enable mixed precision training?** Set `precision="16-mixed"` in the Trainer: ```python trainer = L.Trainer(accelerator="gpu", devices=8, precision="16-mixed") ``` **Q: How do I use multiple GPUs on a single node?** Set `devices=N` where N is the number of available GPUs (e.g., `devices=8`), and `strategy="ddp"`. **Q: Where are checkpoints and logs stored?** By default, under `default_root_dir` (set to `/mnt/shared/checkpoints` in the example above) inside a `lightning_logs/` subdirectory. ---