Skip to main content

OpenMPI

This guide explains how to set up and run distributed training with PyTorch on an OpenMPI cluster.


Topology

Cluster Setup

  • Environment: Ensure all nodes have identical software environments, including PyTorch, OpenMPI, CUDA, and NCCL versions.
  • Communication: Passwordless SSH is required for seamless communication between nodes.

Connect

ssh $hostname

Shared Storage

  • Use NFS or Dataset storage systems that are accessible by all nodes in the cluster to store datasets, logs, and checkpoints.

Training Guide

Getting Started

  1. Install the required dependencies:
pip install torch torchvision
  1. Write a training script for OpenMPI distributed training. For example:
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP

#Initialize the process group
os.environ["MASTER_ADDR"] = "127.0.0.1" # Replace with the master node's IP
os.environ["MASTER_PORT"] = "29500"
torch.distributed.init_process_group(backend="nccl")

#Define your model
model = nn.Linear(10, 1).to("cuda")
model = DDP(model)

#Define optimizer and loss function
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

#Training loop
for epoch in range(10):
optimizer.zero_grad()
inputs = torch.randn(32, 10).to("cuda")
targets = torch.randn(32, 1).to("cuda")
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
  1. Generate an MPI hostfile listing all the nodes in your cluster. Example:
#Save this as hostfile
node1 slots=8
node2 slots=8
  1. Launch the training script using mpirun:
mpirun--hostfile hostfile -np 16 python script.py

Import/Export Data

  • Import: Place datasets in the shared storage system (/mnt/shared) or access preloaded datasets.
  • Export: Save model checkpoints and logs to the shared file system for easy access across nodes.

Training Metrics

You can integrate monitoring tools like TensorBoard or Weights & Biases for real-time insights into your training process.

TensorBoard/MLflow

  • Write logs to the shared file system (/mnt/shared) for combined metrics from all workers.
  • Monitor training progress using:
ssh - L 6006:localhost:6006 $hostname
tensorboard --logdir /mnt/shared/logs

Weights & Biases

  • For offline mode, store metrics in /mnt/shared to access them from any node. This ensures seamless analysis and reproducibility.

Administration Guide

  1. User Management: Ensure all users have passwordless SSH set up and identical environments on all nodes.
  2. Cluster Health Monitoring: Use tools like nvidia-smi or htop to monitor GPU and CPU usage across nodes.
  3. Storage Management: Regularly check disk usage on /mnt/shared to ensure enough space for logs and checkpoints.

Troubleshooting Guide

Common Issues and Solutions

SSH Issues

  • Error: "Permission denied (publickey)."
  • Solution: Ensure passwordless SSH is correctly configured for all nodes.

MPI Communication Errors

  • Error: "Connection refused."
  • Solution: Ensure nodes can communicate over the specified port and that firewalls are configured to allow traffic.

NCCL Initialization Errors

  • Error: "NCCL connection timed out."
  • Solution: Verify that all nodes are in the same network and have NCCL configured correctly.

Disk Space Errors

  • Error: "No space left on device."
  • Solution: Clean up old checkpoints and logs from /mnt/shared.

FAQ

Q: How do I ensure the same environment across nodes?

A: Use Docker containers or environment management tools like Conda to replicate the setup across all nodes.

Q: How do I configure OpenMPI for distributed training?

A: Install OpenMPI, create a hostfile with the cluster nodes, and use mpirun to launch the script.

Q: How do I access logs stored on shared storage?

A: Mount the /mnt/shared file system on any node or access it directly via SSH.


With this guide, you’re ready to use OpenMPI for scalable and efficient PyTorch distributed training on your cluster!