Overview

What is a Training Cluster?

A Training Cluster is a dedicated GPU compute environment for running distributed AI/ML training workloads on E2E AI Cloud. You provision a cluster with a fixed plan — specifying the number of nodes, GPU type, CPU, and RAM — and run Slurm jobs on it using any supported container image.

Billing is fixed per cluster plan regardless of resource utilization. All jobs running on the cluster are included at no additional compute charge.

How It Works

You create a cluster in a single setup flow — selecting the image, plan, node count, SSH access, and storage together. Once the cluster is running, you connect via SSH and submit training jobs using Slurm. The TIR dashboard gives you full visibility into job status, node health, and GPU metrics without needing to SSH in for monitoring.

Concept	Description
Cluster	A fixed pool of GPU nodes scheduled by Slurm, provisioned at a predictable hourly rate
Node	An individual compute unit within the cluster, running your container image
Job	A Slurm workload submitted to the cluster — PyTorch, MPI, shell scripts, or any containerized process

Who Should Use a Training Cluster?

Training Clusters are best suited for teams that:

Run large-scale distributed model training workloads
Need multi-GPU or multi-node training environments
Want predictable, fixed-rate billing for training infrastructure
Need to run custom container environments with the full power of Slurm scheduling — job queuing, priority management, and multi-node coordination — via Enroot
Require shared storage accessible across all nodes
Need elastic node scaling without disrupting running jobs
Require fault-isolated, auto-recovering infrastructure

Key Benefits

Fixed Pricing — Billed per cluster plan, not per GPU hour or utilization
No Per-Job Charges — All Slurm jobs run on the cluster at no extra cost
Slurm-Native Scheduling — All workloads scheduled through Slurm via Slinky and Pyaxis
Flexible Container Support — Ubuntu Slurm pre-built images or any Docker/OCI container via Enroot
Ubuntu Slurm Images — Pre-built images that bring up a fully configured multi-node Slurm environment instantly, with GPU drivers, CUDA, and NCCL ready out of the box — no manual cluster setup required
Elastic Scaling — Scale nodes up on a live cluster without teardown or workload disruption
High Availability — Node failures are isolated; the cluster recovers automatically
Jobs Dashboard — Full squeue visibility in the TIR dashboard — no SSH required for monitoring
Shared Storage — Datasets, SFS, and PFS mounted across all cluster nodes
DCGM Node Metrics — Per-node GPU health monitoring with XID error detection

Training Cluster vs On-Demand Instance

Feature	Training Cluster	On-Demand Instance
Billing model	Fixed per cluster plan	Per-hour while running
GPU availability	Reserved for the cluster	Subject to inventory
Multi-node training	Yes — Slurm across multiple nodes	Single node only
Job scheduling	Slurm-native with dashboard visibility	Manual
Container images	Ubuntu Slurm images + any Docker/OCI via Enroot	Any (custom setup)
Shared storage	Datasets, SFS, PFS across all nodes	Attached per instance
Elastic scaling	Live node scale-up, no teardown	No
High availability	Node failures isolated, auto-resume	No
Best for	Distributed training at scale	Experimentation, single-node jobs

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

What is a Training Cluster?​

How It Works​

Who Should Use a Training Cluster?​

Key Benefits​

Training Cluster vs On-Demand Instance​

What is a Training Cluster?

How It Works

Who Should Use a Training Cluster?

Key Benefits

Training Cluster vs On-Demand Instance