Skip to main content

OpenMPI Deployment

OpenMPI provides a flexible and efficient method to run distributed training across multiple nodes.
This guide demonstrates how to configure and execute PyTorch distributed jobs using OpenMPI on TIR.


Topology

Cluster Setup

  • Environment:
    Ensure that all nodes have identical environments, including matching versions of PyTorch, OpenMPI, CUDA, and NCCL.

  • Communication:
    Configure passwordless SSH between all nodes for seamless communication.

Connect to a Node

ssh $hostname

Shared Storage

Use NFS or any shared dataset storage system accessible by all nodes to store:

  • Training datasets
  • Logs
  • Model checkpoints
/mnt/shared

Training Guide

Getting Started

  1. Install Required Dependencies

    Ensure that PyTorch and its dependencies are installed on all nodes:

    pip install torch torchvision
  2. Write a Training Script for OpenMPI

    Save the following example as train_mpi.py:

    import os
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.nn.parallel import DistributedDataParallel as DDP

    # Initialize the process group
    os.environ["MASTER_ADDR"] = "127.0.0.1" # Replace with master node IP
    os.environ["MASTER_PORT"] = "29500"
    torch.distributed.init_process_group(backend="nccl")

    # Define the model
    model = nn.Linear(10, 1).to("cuda")
    model = DDP(model)

    # Define optimizer and loss function
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    criterion = nn.MSELoss()

    # Training loop
    for epoch in range(10):
    optimizer.zero_grad()
    inputs = torch.randn(32, 10).to("cuda")
    targets = torch.randn(32, 1).to("cuda")
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()

    if torch.distributed.get_rank() == 0:
    print(f"Epoch {epoch} | Loss: {loss.item():.4f}")
  3. Generate an MPI Hostfile

    Create a hostfile listing all cluster nodes:

    # Save this as hostfile
    node1 slots=8
    node2 slots=8
  4. Launch the Training Script with mpirun

    Run the job across all nodes using:

    mpirun --hostfile hostfile -np 16 python train_mpi.py

    Arguments:

    FlagDescription
    --hostfilePath to file listing cluster nodes
    -npTotal number of processes (e.g., 8 GPUs × 2 nodes = 16)
    python train_mpi.pyTraining script to execute

Data Management

Import Data

Place datasets in the shared storage system so they are accessible from all nodes:

/mnt/shared/datasets

Export Checkpoints and Logs

Save trained models and logs to shared storage for easy access:

torch.save(model.state_dict(), "/mnt/shared/checkpoints/model_epoch_10.pt")

Training Metrics

You can use TensorBoard or Weights & Biases (W&B) for training visualization and monitoring.

TensorBoard

  • Store logs in the shared directory /mnt/shared/logs for centralized tracking.

  • To start TensorBoard:

    tensorboard --logdir /mnt/shared/logs
  • To access it remotely:

    ssh -L 6006:localhost:6006 $hostname

Weights & Biases

  • For offline usage, log metrics locally under /mnt/shared:

    import wandb
    wandb.init(project="pytorch-mpi", mode="offline")
    wandb.log({"loss": loss.item()})

Administration Guide

  1. User Management Configure passwordless SSH and ensure identical environments for all users on every node.

  2. Cluster Health Monitoring Use standard monitoring tools:

    nvidia-smi      # Monitor GPU utilization
    htop # Monitor CPU and memory usage
  3. Storage Management Regularly check and clean up /mnt/shared to prevent space exhaustion.


Troubleshooting Guide

Common Issues

IssueDescriptionSolution
Permission denied (publickey)SSH not configured correctlySet up passwordless SSH for all nodes
MPI connection refusedNetwork port unavailableVerify ports and firewall configurations
NCCL connection timed outNode communication issueEnsure all nodes are on the same network
No space left on deviceShared storage fullDelete old checkpoints or logs

FAQ

How do I ensure the same environment across all nodes?

Use containerized environments (e.g., Docker) or Conda to replicate configurations across all nodes.

How do I configure OpenMPI for distributed training?

Install OpenMPI, create a valid hostfile listing your nodes, and launch training using the mpirun command.

How do I access logs stored on shared storage?

Mount /mnt/shared on any node or access it directly via SSH.