Skip to main content

PyTorch Lightning

PyTorch Lightning is a high-level deep learning framework built on top of PyTorch.
It simplifies the training workflow, enabling easy scaling across multiple GPUs on a single node within the Training Cluster (TIR) environment.

Topology

Cluster Setup

  • Environment: Ensure your node has compatible versions of CUDA, NCCL, and PyTorch Lightning installed.
  • Communication: No inter-node communication setup is required since this is a single-node configuration.

Connect

Access your Training Cluster node via SSH:

ssh $hostname

Shared Storage

Use Shared File System (SFS) or Dataset storage accessible within your node for managing datasets, checkpoints, and logs.


Training Guide

Getting Started

  1. Install Dependencies
pip install pytorch-lightning
  1. Initialize Your Lightning Project

Create a LightningModule that defines:

  • Model architecture
  • Training, validation, and testing steps
  • Optimizers and schedulers
  1. Configure Trainer for Single-Node, Multi-GPU Training
from pytorch_lightning import Trainer

trainer = Trainer(
accelerator="gpu",
devices=8, # Number of GPUs on your node
strategy="ddp", # Distributed Data Parallel strategy
num_nodes=1,
max_epochs=10
)
trainer.fit(model, datamodule)
  1. Run the Training Script Execute the script on your node:
python -m torch.distributed.run --nproc_per_node=8 script.py

Import/Export Data

  • Import: Place your datasets in the Shared File System (SFS) or Dataset storage mounted on the node (e.g., /mnt/shared).
  • Export: Save model checkpoints and logs to shared paths for persistence and easy access.

Training Metrics

You can integrate monitoring tools such as TensorBoard or Weights & Biases (W&B) for real-time insights.

TensorBoard

  • Write logs to /mnt/shared/logs to keep all training metrics centralized.
  • View them using:
tensorboard --logdir /mnt/shared/logs --port 6006

Access remotely:

ssh -L 6006:localhost:6006 $hostname

Weights & Biases

  • Enable offline logging by setting your W&B directory to /mnt/shared/wandb for consistent access across sessions.

Administration Guide

  1. User Management: Ensure all users accessing the Training Cluster have valid SSH keys and permissions for the node.

  2. Cluster Monitoring: Use tools like:

    nvidia-smi    # GPU monitoring
    htop # CPU and process monitoring
    df -h # Disk space usage
  3. Storage Management: Periodically clean up old logs and checkpoints from /mnt/shared to maintain free space.


Troubleshooting Guide

Common Issues and Solutions

1. CUDA Out of Memory

Cause: Batch size too large for available GPU memory. Fix: Reduce the batch size or enable gradient accumulation.

2. NCCL Initialization Errors

Cause: Conflicts with other processes using GPUs. Fix: Reboot the node or kill orphaned GPU processes:

sudo fuser -v /dev/nvidia*

3. Disk Space Errors

Cause: Checkpoints or logs filling up storage. Fix: Delete unnecessary files:

rm -rf /mnt/shared/logs/*

4. SSH Connection Issues

Cause: Network connectivity problems or SSH configuration issues. Fix: Verify network connectivity and SSH configuration.


FAQ

Q: How do I use multiple GPUs on a single node?

A: Set devices in the Trainer to the number of GPUs available (e.g., devices=8).

Q: What strategy should I use?

A: Use "ddp" for optimal single-node multi-GPU training performance.

Q: Where are logs and checkpoints stored?

A: They are saved under your configured shared path, typically /mnt/shared/checkpoints or /mnt/shared/logs.

Q: Can I monitor training progress remotely?

A: Yes, by setting up TensorBoard or W&B and tunneling ports over SSH.