Quick Start
This guide walks you through creating a Training Cluster and connecting to it.
Step 1: Navigate to Training Cluster
- Go to the TIR Dashboard.
- In the left sidebar, click Training Cluster.
- Click Create Training Cluster.
Step 2: Configure Your Cluster
Enter a name for your cluster.
Image
Select the image and version for your cluster nodes.
- Ubuntu Slurm and NeMo Framework images are available in multiple versions — each includes pre-installed GPU drivers, CUDA, NCCL, and the Slurm runtime.
- The version dropdown shows available releases for the selected image.
For custom Docker or OCI images, you can pull and convert them using Enroot, then cache the resulting squash files on your shared storage (PFS/SFS) for reuse across jobs — avoiding repeated pulls on every run.
Plan Configuration
We recommend setting up a Private Cluster before creating a Training Cluster — it reserves a dedicated pool of nodes for your team, ensuring GPU capacity is always available and provisioning is faster. For more details, refer to the Private Cluster documentation.
Choose between the GPU and Private Cluster tabs:
- GPU — Select from available GPU plan cards. Each card shows the GPU type, CPU count, RAM, and hourly rate. Use the Workers counter to set the number of nodes.
- Private Cluster — Nodes are reserved exclusively for your workloads, making this ideal if you plan to create and recreate clusters frequently to hold reserved node capacity. See the Private Cluster documentation for setup details.
A Pricing summary appears automatically based on your selected plan and node count.
Access
| Field | Required | Description |
|---|---|---|
| SSH Keys | Yes | SSH key for connecting to the login node, which is used to schedule and submit workloads to the worker nodes |
| Parallel File System | Yes | PFS volume mounted on all cluster nodes |
Advanced Settings
Expand Advanced Settings to configure:
| Field | Required | Description |
|---|---|---|
| Security Group | Yes | Controls inbound and outbound network access to the cluster |
| Lifecycle Script | No | Script that runs on each node after the cluster is created |
| Shared File System | No | SFS volume mounted on cluster nodes |
| Dataset Storage | No | Dataset attached to cluster nodes — mounted as read-only |
At least one storage volume must be mounted on the cluster. Use PFS or SFS for read-write access. Datasets are mounted as read-only and are suitable for loading training data but cannot be used to write checkpoints or logs.
Once all required fields are filled, click Create Training Cluster.
Provisioning typically completes within a few minutes. Node count, SSH keys, and container image can be updated after creation without recreating the cluster.
You can also create a Training Cluster using the API. Refer to the Training Cluster API Reference for parameters and examples.
Step 3: Connect to Your Cluster
Once the cluster status shows Running:
- Click the Connect button (terminal icon) in the top-right area of the Cluster Details page.
- Follow the connection instructions in the sidebar, or use the SSH Command displayed in the Details tab directly:
ssh root@<floating-ip>
The SSH command and Floating IP are also visible under Connection Details in the Details tab.