Skip to main content

Features

1. Slurm-Native Scheduling

All Training Cluster workloads are scheduled through Slurm — a job scheduler purpose-built for HPC and large-scale distributed training. It is delivered via Slinky (Kubernetes-native Slurm) and Pyaxis, which remove the need for deep HPC expertise while providing full Slurm capabilities: job queuing, priority management, resource allocation, and multi-node coordination across all nodes in the cluster.


2. Images & Container Support

Pre-Built Images

Training Cluster provides Ubuntu Slurm pre-built images, available in multiple versions. Each image includes:

  • Pre-installed GPU drivers, CUDA, NCCL, and Slurm runtime
  • Getting-started and setup scripts for rapid deployment

Select the image and version during cluster creation from the Image dropdowns.

Custom Containers via Enroot

Any Docker or OCI-compatible container image can run on Training Cluster using Enroot. Enroot converts the image into a squash file stored on a shared volume and mounts it at job start — with no per-node image pulling or duplication. All containers run with full CUDA access.

Specify your custom image in the Image field during cluster creation. You can also update the cluster image at any time via Update Image in the Actions menu, without recreating the cluster.


3. Elastic Scaling

Scale your cluster's node count on a live cluster — no teardown, no disruption to running workloads.

OperationRequires Restart
Scale nodes upNo
Update container imageNo
Attach additional storage volumesYes

To scale, use Scale Cluster from the Actions menu.


4. High Availability

Training Cluster provides node-level fault tolerance:

  • Isolated failures — A failure on one node does not affect workloads running on other nodes
  • Automatic recovery — The cluster resumes automatically after a node restart, without manual intervention

For job-level resilience, checkpoint your training periodically to shared storage so jobs can resume from the last checkpoint after a node failure.


5. Cluster Overview

The Cluster Overview tab provides a live summary of your cluster's resource state and Slurm topology.

Summary Cards

CardDescription
Total GPUsTotal GPU count across all nodes in the cluster
Allocated GPUsGPUs currently assigned to running jobs
GPU UtilAverage GPU utilization across active nodes
GPU MemoryAverage GPU memory used per GPU
Running JobsActive jobs, with pending queue count
Idle NodesNodes available for workloads

Node Health

Displays the health state of every node in the cluster:

StateDescription
IDLENode is available and ready for workloads
ALLOCATEDNode is running one or more jobs
MIXEDNode has some resources allocated and some idle
UNKNOWNNode unavailability or health status is unknown

Slurm Partition Info

Shows the Slurm partition configuration sourced directly from slurmrestd:

ColumnDescription
PartitionSlurm partition name (e.g., slinky, all)
NodesNumber of nodes in the partition
GPU TypeGPU type assigned to the partition
Avail / TotalAvailable nodes out of total
Default TimeDefault job time limit
Max TimeMaximum allowed job time
StatePartition state (UP / DOWN)

6. Jobs

The Jobs tab surfaces full Slurm squeue visibility across all scheduled workloads — no SSH required.

Job Counters

Real-time counts for each job state:

StateDescription
RunningJobs actively executing on nodes
PendingJobs queued and waiting for available resources
CompletedJobs that finished successfully
FailedJobs that terminated with an error
UnknownJobs in an indeterminate state

Job Table

ColumnDescription
Job IDUnique Slurm job identifier
NameName of the submitted job
UserUser who submitted the job
PartitionSlurm partition the job is assigned to
NodesWorker nodes allocated to the job
GPUsNumber of GPUs allocated to the job
Run TimeElapsed time since job start
Time LimitMaximum allowed runtime
StateCurrent job state
PriorityScheduling priority relative to other jobs

Use the filter tabs to view jobs by state: ALL, RUNNING, PENDING, COMPLETED, FAILED, or UNKNOWN.


7. Monitoring

The Monitoring tab provides per-node GPU metrics with time-interval controls.

Job Summary

CardDescription
Running JobsNumber of actively executing jobs
Pending JobsNumber of jobs waiting in queue
CompletedTotal completed jobs
CPU AllocLatest CPU allocation value

GPU Metrics — All Nodes Overlay

GPU Utilisation % is plotted per node over the selected time interval. Use the time selector to switch between 5m, 15m, 1h, 6h, and 1d windows. Select a specific node from the Select Node dropdown to isolate metrics for a single node.


8. Nodes

The Nodes tab provides health visibility and DCGM (Data Center GPU Manager) metrics for every node in the cluster.

Node Summary Cards

CardDescription
Total NodesTotal nodes in the cluster pool
AllocatedNodes currently running jobs
Avg GPU UtilAverage GPU utilization across active nodes
Total GPUsTotal GPU capacity and memory

Node Filters

Filter nodes by health state:

FilterDescription
AllShow all nodes
FailedNodes that have failed or are unresponsive
XID ErrorsNodes reporting NVIDIA XID hardware errors
HealthyNodes operating normally

Click any node row to drill down into per-GPU metrics.

XID Error Visibility & Node Restart

The XID Errors filter surfaces nodes reporting NVIDIA GPU hardware errors in real time, giving you immediate visibility without needing to inspect logs manually. From the same view, you can restart an affected node directly — use Restart All Workers from the Actions menu to recover nodes reporting XID errors without restarting the entire cluster.


9. Logs

The Logs tab provides access to Slurm controller and node logs directly from the dashboard.

  • Select Replica — Switch between the Slurm controller and individual node replicas
  • Auto Refresh — Enable continuous log streaming
  • Filter By — Filter to the last N lines of output
  • Download — Export logs for offline analysis

10. Shared Storage (Volumes)

The Volumes tab manages storage attached to the cluster. Volumes provide persistent storage accessible from all cluster nodes.

Storage Types

TypeDescription
DatasetsManaged datasets with mount/unmount controls per cluster
Shared File System (SFS)POSIX-compatible shared file storage
Parallel File System (PFS)High-throughput parallel I/O for large-scale training data

Each storage section shows Mounted and Unmounted volumes, with columns for storage type, encryption, mount status, mount path, and actions.

info

Parallel File System is required at cluster creation time. SFS and Datasets can be attached from the Volumes tab after creation (requires a cluster restart to take effect).


11. Network & Security

The Network & Security tab manages IP addressing and firewall rules for your cluster.

Reserve IP

Attach a reserved public IP to your cluster for stable, persistent external access.

  1. Open the Network & Security tab → Reserve IP.
  2. Select an available IP from the dropdown.
  3. Click Attach IP.

To create a new reserved IP, click Reserve New IP.

Security Groups

Manage the security groups attached to your cluster from the Security Groups sub-section. Security groups control inbound and outbound traffic to all cluster nodes.

Connection Details

The cluster's current Floating IP and SSH command are visible in the Details tab under Connection Details. You can convert the Floating IP to a Reserved IP directly from there.


12. SSH Access

Every cluster node exposes SSH access. The SSH key is selected at cluster creation and can be updated at any time via Update SSH Keys in the Actions menu.

Connect to the Cluster

  1. Open the Cluster Details page.
  2. Click the Connect button (terminal icon) in the top-right area for connection instructions.
  3. Or use the SSH command from the Details tab directly:
ssh root@<floating-ip>

Best Practices for Training Clusters

Save checkpoints to attached storage

Store model checkpoints and training logs on PFS or SFS volumes. Data inside a node's local filesystem is lost on restart — only attached storage persists.

Use the Jobs tab for status checks

The Jobs tab shows full squeue output — running, pending, failed, and completed jobs — without needing to SSH into the cluster.

Checkpoint frequently for job resilience

Enable periodic checkpointing in your training script so jobs can resume from the last checkpoint after an unexpected node restart or failure.

Use Private Cluster for dedicated capacity

Choose Private Cluster in Plan Configuration to reserve nodes exclusively for your workloads — recommended for teams with consistent or long-running training schedules.