OpenMPI Deployment
OpenMPI provides a flexible and efficient method to run distributed training across multiple nodes.
This guide demonstrates how to configure and execute PyTorch distributed jobs using OpenMPI on TIR.
Topology
Cluster Setup
-
Environment:
Ensure that all nodes have identical environments, including matching versions of PyTorch, OpenMPI, CUDA, and NCCL. -
Communication:
Configure passwordless SSH between all nodes for seamless communication.
Connect to a Node
ssh $hostname
Shared Storage
Use NFS or any shared dataset storage system accessible by all nodes to store:
- Training datasets
- Logs
- Model checkpoints
/mnt/shared
Training Guide
Getting Started
-
Install Required Dependencies
Ensure that PyTorch and its dependencies are installed on all nodes:
pip install torch torchvision -
Write a Training Script for OpenMPI
Save the following example as
train_mpi.py:import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
# Initialize the process group
os.environ["MASTER_ADDR"] = "127.0.0.1" # Replace with master node IP
os.environ["MASTER_PORT"] = "29500"
torch.distributed.init_process_group(backend="nccl")
# Define the model
model = nn.Linear(10, 1).to("cuda")
model = DDP(model)
# Define optimizer and loss function
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
# Training loop
for epoch in range(10):
optimizer.zero_grad()
inputs = torch.randn(32, 10).to("cuda")
targets = torch.randn(32, 1).to("cuda")
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
if torch.distributed.get_rank() == 0:
print(f"Epoch {epoch} | Loss: {loss.item():.4f}") -
Generate an MPI Hostfile
Create a hostfile listing all cluster nodes:
# Save this as hostfile
node1 slots=8
node2 slots=8 -
Launch the Training Script with
mpirunRun the job across all nodes using:
mpirun --hostfile hostfile -np 16 python train_mpi.pyArguments:
Flag Description --hostfilePath to file listing cluster nodes -npTotal number of processes (e.g., 8 GPUs × 2 nodes = 16) python train_mpi.pyTraining script to execute
Data Management
Import Data
Place datasets in the shared storage system so they are accessible from all nodes:
/mnt/shared/datasets
Export Checkpoints and Logs
Save trained models and logs to shared storage for easy access:
torch.save(model.state_dict(), "/mnt/shared/checkpoints/model_epoch_10.pt")
Training Metrics
You can use TensorBoard or Weights & Biases (W&B) for training visualization and monitoring.
TensorBoard
-
Store logs in the shared directory
/mnt/shared/logsfor centralized tracking. -
To start TensorBoard:
tensorboard --logdir /mnt/shared/logs -
To access it remotely:
ssh -L 6006:localhost:6006 $hostname
Weights & Biases
-
For offline usage, log metrics locally under
/mnt/shared:import wandb
wandb.init(project="pytorch-mpi", mode="offline")
wandb.log({"loss": loss.item()})
Administration Guide
-
User Management Configure passwordless SSH and ensure identical environments for all users on every node.
-
Cluster Health Monitoring Use standard monitoring tools:
nvidia-smi # Monitor GPU utilization
htop # Monitor CPU and memory usage -
Storage Management Regularly check and clean up
/mnt/sharedto prevent space exhaustion.
Troubleshooting Guide
Common Issues
| Issue | Description | Solution |
|---|---|---|
| Permission denied (publickey) | SSH not configured correctly | Set up passwordless SSH for all nodes |
| MPI connection refused | Network port unavailable | Verify ports and firewall configurations |
| NCCL connection timed out | Node communication issue | Ensure all nodes are on the same network |
| No space left on device | Shared storage full | Delete old checkpoints or logs |
FAQ
How do I ensure the same environment across all nodes?
Use containerized environments (e.g., Docker) or Conda to replicate configurations across all nodes.
How do I configure OpenMPI for distributed training?
Install OpenMPI, create a valid hostfile listing your nodes, and launch training using the mpirun command.
How do I access logs stored on shared storage?
Mount /mnt/shared on any node or access it directly via SSH.