Skip to main content

Quick Start

This guide walks you through creating a Training Cluster and connecting to it.

Step 1: Navigate to Training Cluster

  1. Go to the TIR Dashboard.
  2. In the left sidebar, click Training Cluster.
  3. Click Create Training Cluster.

Step 2: Configure Your Cluster

Enter a name for your cluster.

Image

Select the image and version for your cluster nodes.

  • Ubuntu Slurm and NeMo Framework images are available in multiple versions — each includes pre-installed GPU drivers, CUDA, NCCL, and the Slurm runtime.
  • The version dropdown shows available releases for the selected image.

For custom Docker or OCI images, you can pull and convert them using Enroot, then cache the resulting squash files on your shared storage (PFS/SFS) for reuse across jobs — avoiding repeated pulls on every run.

Plan Configuration

Recommended

We recommend setting up a Private Cluster before creating a Training Cluster — it reserves a dedicated pool of nodes for your team, ensuring GPU capacity is always available and provisioning is faster. For more details, refer to the Private Cluster documentation.

Choose between the GPU and Private Cluster tabs:

  • GPU — Select from available GPU plan cards. Each card shows the GPU type, CPU count, RAM, and hourly rate. Use the Workers counter to set the number of nodes.
  • Private Cluster — Nodes are reserved exclusively for your workloads, making this ideal if you plan to create and recreate clusters frequently to hold reserved node capacity. See the Private Cluster documentation for setup details.

A Pricing summary appears automatically based on your selected plan and node count.

Access

FieldRequiredDescription
SSH KeysYesSSH key for connecting to the login node, which is used to schedule and submit workloads to the worker nodes
Parallel File SystemYesPFS volume mounted on all cluster nodes

Advanced Settings

Expand Advanced Settings to configure:

FieldRequiredDescription
Security GroupYesControls inbound and outbound network access to the cluster
Lifecycle ScriptNoScript that runs on each node after the cluster is created
Shared File SystemNoSFS volume mounted on cluster nodes
Dataset StorageNoDataset attached to cluster nodes — mounted as read-only
info

At least one storage volume must be mounted on the cluster. Use PFS or SFS for read-write access. Datasets are mounted as read-only and are suitable for loading training data but cannot be used to write checkpoints or logs.

Once all required fields are filled, click Create Training Cluster.

info

Provisioning typically completes within a few minutes. Node count, SSH keys, and container image can be updated after creation without recreating the cluster.

You can also create a Training Cluster using the API. Refer to the Training Cluster API Reference for parameters and examples.


Step 3: Connect to Your Cluster

Once the cluster status shows Running:

  1. Click the Connect button (terminal icon) in the top-right area of the Cluster Details page.
  2. Follow the connection instructions in the sidebar, or use the SSH Command displayed in the Details tab directly:
ssh root@<floating-ip>

The SSH command and Floating IP are also visible under Connection Details in the Details tab.


Next Steps

  • Features — Scheduling, images, monitoring, scaling, and more
  • Actions — Manage your cluster
  • Billing — Understand cluster pricing