Skip to main content

Overview

What is a Training Cluster?

A Training Cluster is a dedicated GPU compute environment for running distributed AI/ML training workloads on E2E AI Cloud. You provision a cluster with a fixed plan — specifying the number of nodes, GPU type, CPU, and RAM — and run Slurm jobs on it using any supported container image.

Billing is fixed per cluster plan regardless of resource utilization. All jobs running on the cluster are included at no additional compute charge.


How It Works

You create a cluster in a single setup flow — selecting the image, plan, node count, SSH access, and storage together. Once the cluster is running, you connect via SSH and submit training jobs using Slurm. The TIR dashboard gives you full visibility into job status, node health, and GPU metrics without needing to SSH in for monitoring.

ConceptDescription
ClusterA fixed pool of GPU nodes scheduled by Slurm, provisioned at a predictable hourly rate
NodeAn individual compute unit within the cluster, running your container image
JobA Slurm workload submitted to the cluster — PyTorch, MPI, shell scripts, or any containerized process

Who Should Use a Training Cluster?

Training Clusters are best suited for teams that:

  • Run large-scale distributed model training workloads
  • Need multi-GPU or multi-node training environments
  • Want predictable, fixed-rate billing for training infrastructure
  • Need to run custom container environments with the full power of Slurm scheduling — job queuing, priority management, and multi-node coordination — via Enroot
  • Require shared storage accessible across all nodes
  • Need elastic node scaling without disrupting running jobs
  • Require fault-isolated, auto-recovering infrastructure

Key Benefits

  • Fixed Pricing — Billed per cluster plan, not per GPU hour or utilization
  • No Per-Job Charges — All Slurm jobs run on the cluster at no extra cost
  • Slurm-Native Scheduling — All workloads scheduled through Slurm via Slinky and Pyaxis
  • Flexible Container Support — Ubuntu Slurm pre-built images or any Docker/OCI container via Enroot
  • Ubuntu Slurm Images — Pre-built images that bring up a fully configured multi-node Slurm environment instantly, with GPU drivers, CUDA, and NCCL ready out of the box — no manual cluster setup required
  • Elastic Scaling — Scale nodes up on a live cluster without teardown or workload disruption
  • High Availability — Node failures are isolated; the cluster recovers automatically
  • Jobs Dashboard — Full squeue visibility in the TIR dashboard — no SSH required for monitoring
  • Shared Storage — Datasets, SFS, and PFS mounted across all cluster nodes
  • DCGM Node Metrics — Per-node GPU health monitoring with XID error detection

Training Cluster vs On-Demand Instance

FeatureTraining ClusterOn-Demand Instance
Billing modelFixed per cluster planPer-hour while running
GPU availabilityReserved for the clusterSubject to inventory
Multi-node trainingYes — Slurm across multiple nodesSingle node only
Job schedulingSlurm-native with dashboard visibilityManual
Container imagesUbuntu Slurm images + any Docker/OCI via EnrootAny (custom setup)
Shared storageDatasets, SFS, PFS across all nodesAttached per instance
Elastic scalingLive node scale-up, no teardownNo
High availabilityNode failures isolated, auto-resumeNo
Best forDistributed training at scaleExperimentation, single-node jobs