Skip to main content

PyTorch

PyTorch is a deep learning framework designed for professional AI researchers and machine learning engineers. You can deploy a multi-node Pytorch setup on TIR within seconds.


Topology

Cluster Setup

  • Environment: Ensure identical software environments across all nodes, including CUDA, NCCL, and PyTorch versions.
  • Communication: Passwordless SSH is required for seamless inter-node communication.

Connect

ssh $hostname

Shared Storage

  • Use NFS or Dataset storage systems that are accessible by all nodes in the cluster for shared data storage and model checkpoints.

Training Guide

Getting Started

  1. Install the required dependencies:
pip install ....
  1. Initialize your project by creating a PytorchModule for your model and defining training, validation, and test steps.

  2. Write a Training script for distributed training. For example:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP

#Initialize the process group
torch.distributed.init_process_group("nccl")

#Define your model
model = nn.Linear(10, 1).to("cuda")
model = DDP(model)

#Define optimizer and loss function
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

#Training loop
for epoch in range(10):
optimizer.zero_grad()
inputs = torch.randn(32, 10).to("cuda")
targets = torch.randn(32, 1).to("cuda")
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()`}
  1. Run the training script across all nodes:
python - m torch.distributed.run --nproc_per_node=8 --nnodes=2 --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR script.py

Import/Export Data

  • Import: Place datasets in the shared storage system (/mnt/shared) or access preloaded datasets.
  • Export: Save model checkpoints and logs to the shared file system for easy access across nodes.

Training Metrics

You can integrate monitoring tools like TensorBoard or Weights & Biases for real-time insights into your training process.

TensorBoard/MLflow
  • Write logs to the shared file system (/mnt/shared) for combined metrics from all workers.
  • Monitor training progress using:
ssh - L 6006:localhost:6006 $hostname
tensorboard --logdir /mnt/shared/logs
Weights & Biases
  • For offline mode, store metrics in /mnt/shared to access them from any node. This ensures seamless analysis and reproducibility.

Administration Guide

  1. User Management: Ensure all users have passwordless SSH set up and identical environments on all nodes.
  2. Cluster Health Monitoring: Use tools like nvidia-smi or htop to monitor GPU and CPU usage across nodes.
  3. Storage Management: Regularly check disk usage on /mnt/shared to ensure enough space for logs and checkpoints.

Troubleshooting Guide

Common Issues and Solutions

SSH Issues

  • Error: "Permission denied (publickey)."
  • Solution: Ensure passwordless SSH is correctly configured for all nodes.

NCCL Initialization Errors

  • Error: "NCCL connection timed out."
  • Solution: Verify that all nodes are in the same network and can communicate via their IPs.

Disk Space Errors

  • Error: "No space left on device."
  • Solution: Clean up old checkpoints and logs from /mnt/shared.

FAQ

Q: How do I ensure the same environment across nodes?

A: Use Docker containers or environment management tools like Conda to replicate the setup across all nodes.

Q: What strategy should I use for multi-node training?

A: Use the Distributed Data Parallel (DDP) strategy provided by PyTorch.

Q: How do I access logs stored on shared storage?

A: Mount the /mnt/shared file system on any node or access it directly via SSH.


With this guide, you’re ready to leverage the power of PyTorch for scalable and efficient deep learning on TIR!