PyTorch Lightning
PyTorch Lightning is a high-level deep learning framework built on top of PyTorch.
It simplifies the training workflow, enabling easy scaling across multiple GPUs on a single node within the Training Cluster (TIR) environment.
Topology
Cluster Setup
- Environment: Ensure your node has compatible versions of CUDA, NCCL, and PyTorch Lightning installed.
- Communication: No inter-node communication setup is required since this is a single-node configuration.
Connect
Access your Training Cluster node via SSH:
ssh $hostname
Shared Storage
Use Shared File System (SFS) or Dataset storage accessible within your node for managing datasets, checkpoints, and logs.
Training Guide
Getting Started
- Install Dependencies
pip install pytorch-lightning
- Initialize Your Lightning Project
Create a LightningModule that defines:
- Model architecture
- Training, validation, and testing steps
- Optimizers and schedulers
- Configure Trainer for Single-Node, Multi-GPU Training
from pytorch_lightning import Trainer
trainer = Trainer(
accelerator="gpu",
devices=8, # Number of GPUs on your node
strategy="ddp", # Distributed Data Parallel strategy
num_nodes=1,
max_epochs=10
)
trainer.fit(model, datamodule)
- Run the Training Script Execute the script on your node:
python -m torch.distributed.run --nproc_per_node=8 script.py
Import/Export Data
- Import: Place your datasets in the Shared File System (SFS) or Dataset storage mounted on the node (e.g.,
/mnt/shared). - Export: Save model checkpoints and logs to shared paths for persistence and easy access.
Training Metrics
You can integrate monitoring tools such as TensorBoard or Weights & Biases (W&B) for real-time insights.
TensorBoard
- Write logs to
/mnt/shared/logsto keep all training metrics centralized. - View them using:
tensorboard --logdir /mnt/shared/logs --port 6006
Access remotely:
ssh -L 6006:localhost:6006 $hostname
Weights & Biases
- Enable offline logging by setting your W&B directory to
/mnt/shared/wandbfor consistent access across sessions.
Administration Guide
-
User Management: Ensure all users accessing the Training Cluster have valid SSH keys and permissions for the node.
-
Cluster Monitoring: Use tools like:
nvidia-smi # GPU monitoring
htop # CPU and process monitoring
df -h # Disk space usage -
Storage Management: Periodically clean up old logs and checkpoints from
/mnt/sharedto maintain free space.
Troubleshooting Guide
Common Issues and Solutions
1. CUDA Out of Memory
Cause: Batch size too large for available GPU memory. Fix: Reduce the batch size or enable gradient accumulation.
2. NCCL Initialization Errors
Cause: Conflicts with other processes using GPUs. Fix: Reboot the node or kill orphaned GPU processes:
sudo fuser -v /dev/nvidia*
3. Disk Space Errors
Cause: Checkpoints or logs filling up storage. Fix: Delete unnecessary files:
rm -rf /mnt/shared/logs/*
4. SSH Connection Issues
Cause: Network connectivity problems or SSH configuration issues. Fix: Verify network connectivity and SSH configuration.
FAQ
Q: How do I use multiple GPUs on a single node?
A: Set devices in the Trainer to the number of GPUs available (e.g., devices=8).
Q: What strategy should I use?
A: Use "ddp" for optimal single-node multi-GPU training performance.
Q: Where are logs and checkpoints stored?
A: They are saved under your configured shared path, typically /mnt/shared/checkpoints or /mnt/shared/logs.
Q: Can I monitor training progress remotely?
A: Yes, by setting up TensorBoard or W&B and tunneling ports over SSH.