OpenMPI
This guide explains how to set up and run distributed training with PyTorch on an OpenMPI cluster.
Topology
Cluster Setup
- Environment: Ensure all nodes have identical software environments, including PyTorch, OpenMPI, CUDA, and NCCL versions.
- Communication: Passwordless SSH is required for seamless communication between nodes.
Connect
ssh $hostname
Shared Storage
- Use NFS or Dataset storage systems that are accessible by all nodes in the cluster to store datasets, logs, and checkpoints.
Training Guide
Getting Started
- Install the required dependencies:
pip install torch torchvision
- Write a training script for OpenMPI distributed training. For example:
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
#Initialize the process group
os.environ["MASTER_ADDR"] = "127.0.0.1" # Replace with the master node's IP
os.environ["MASTER_PORT"] = "29500"
torch.distributed.init_process_group(backend="nccl")
#Define your model
model = nn.Linear(10, 1).to("cuda")
model = DDP(model)
#Define optimizer and loss function
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
#Training loop
for epoch in range(10):
optimizer.zero_grad()
inputs = torch.randn(32, 10).to("cuda")
targets = torch.randn(32, 1).to("cuda")
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
- Generate an MPI hostfile listing all the nodes in your cluster. Example:
#Save this as hostfile
node1 slots=8
node2 slots=8
- Launch the training script using
mpirun
:
mpirun--hostfile hostfile -np 16 python script.py
Import/Export Data
- Import: Place datasets in the shared storage system (
/mnt/shared
) or access preloaded datasets. - Export: Save model checkpoints and logs to the shared file system for easy access across nodes.
Training Metrics
You can integrate monitoring tools like TensorBoard or Weights & Biases for real-time insights into your training process.
TensorBoard/MLflow
- Write logs to the shared file system (
/mnt/shared
) for combined metrics from all workers. - Monitor training progress using:
ssh - L 6006:localhost:6006 $hostname
tensorboard --logdir /mnt/shared/logs
Weights & Biases
- For offline mode, store metrics in
/mnt/shared
to access them from any node. This ensures seamless analysis and reproducibility.
Administration Guide
- User Management: Ensure all users have passwordless SSH set up and identical environments on all nodes.
- Cluster Health Monitoring: Use tools like
nvidia-smi
orhtop
to monitor GPU and CPU usage across nodes. - Storage Management: Regularly check disk usage on
/mnt/shared
to ensure enough space for logs and checkpoints.
Troubleshooting Guide
Common Issues and Solutions
SSH Issues
- Error: "Permission denied (publickey)."
- Solution: Ensure passwordless SSH is correctly configured for all nodes.
MPI Communication Errors
- Error: "Connection refused."
- Solution: Ensure nodes can communicate over the specified port and that firewalls are configured to allow traffic.
NCCL Initialization Errors
- Error: "NCCL connection timed out."
- Solution: Verify that all nodes are in the same network and have NCCL configured correctly.
Disk Space Errors
- Error: "No space left on device."
- Solution: Clean up old checkpoints and logs from
/mnt/shared
.
FAQ
Q: How do I ensure the same environment across nodes?
A: Use Docker containers or environment management tools like Conda to replicate the setup across all nodes.
Q: How do I configure OpenMPI for distributed training?
A: Install OpenMPI, create a hostfile with the cluster nodes, and use mpirun
to launch the script.
Q: How do I access logs stored on shared storage?
A: Mount the /mnt/shared
file system on any node or access it directly via SSH.
With this guide, you’re ready to use OpenMPI for scalable and efficient PyTorch distributed training on your cluster!