Overview
What is a Training Cluster?
A Training Cluster is a dedicated GPU compute environment for running distributed AI/ML training workloads on E2E AI Cloud. You provision a cluster with a fixed plan — specifying the number of nodes, GPU type, CPU, and RAM — and run Slurm jobs on it using any supported container image.
Billing is fixed per cluster plan regardless of resource utilization. All jobs running on the cluster are included at no additional compute charge.
How It Works
You create a cluster in a single setup flow — selecting the image, plan, node count, SSH access, and storage together. Once the cluster is running, you connect via SSH and submit training jobs using Slurm. The TIR dashboard gives you full visibility into job status, node health, and GPU metrics without needing to SSH in for monitoring.
| Concept | Description |
|---|---|
| Cluster | A fixed pool of GPU nodes scheduled by Slurm, provisioned at a predictable hourly rate |
| Node | An individual compute unit within the cluster, running your container image |
| Job | A Slurm workload submitted to the cluster — PyTorch, MPI, shell scripts, or any containerized process |
Who Should Use a Training Cluster?
Training Clusters are best suited for teams that:
- Run large-scale distributed model training workloads
- Need multi-GPU or multi-node training environments
- Want predictable, fixed-rate billing for training infrastructure
- Need to run custom container environments with the full power of Slurm scheduling — job queuing, priority management, and multi-node coordination — via Enroot
- Require shared storage accessible across all nodes
- Need elastic node scaling without disrupting running jobs
- Require fault-isolated, auto-recovering infrastructure
Key Benefits
- Fixed Pricing — Billed per cluster plan, not per GPU hour or utilization
- No Per-Job Charges — All Slurm jobs run on the cluster at no extra cost
- Slurm-Native Scheduling — All workloads scheduled through Slurm via Slinky and Pyaxis
- Flexible Container Support — Ubuntu Slurm pre-built images or any Docker/OCI container via Enroot
- Ubuntu Slurm Images — Pre-built images that bring up a fully configured multi-node Slurm environment instantly, with GPU drivers, CUDA, and NCCL ready out of the box — no manual cluster setup required
- Elastic Scaling — Scale nodes up on a live cluster without teardown or workload disruption
- High Availability — Node failures are isolated; the cluster recovers automatically
- Jobs Dashboard — Full squeue visibility in the TIR dashboard — no SSH required for monitoring
- Shared Storage — Datasets, SFS, and PFS mounted across all cluster nodes
- DCGM Node Metrics — Per-node GPU health monitoring with XID error detection
Training Cluster vs On-Demand Instance
| Feature | Training Cluster | On-Demand Instance |
|---|---|---|
| Billing model | Fixed per cluster plan | Per-hour while running |
| GPU availability | Reserved for the cluster | Subject to inventory |
| Multi-node training | Yes — Slurm across multiple nodes | Single node only |
| Job scheduling | Slurm-native with dashboard visibility | Manual |
| Container images | Ubuntu Slurm images + any Docker/OCI via Enroot | Any (custom setup) |
| Shared storage | Datasets, SFS, PFS across all nodes | Attached per instance |
| Elastic scaling | Live node scale-up, no teardown | No |
| High availability | Node failures isolated, auto-resume | No |
| Best for | Distributed training at scale | Experimentation, single-node jobs |