Features

Slurm-Native Scheduling

All workloads scheduled through Slurm via Slinky and Pyaxis — purpose-built for HPC and large-scale distributed training.

→

Images & Container Support

Run Ubuntu Slurm pre-built images or bring any Docker/OCI container via Enroot — no per-node image pulling.

→

Elastic Scaling & High Availability

Scale nodes on a live cluster without teardown. Node failures are isolated and the cluster recovers automatically.

→

Jobs & Monitoring

Full squeue visibility in the Jobs tab, per-node GPU metrics in Monitoring, and DCGM node health in the Nodes tab.

→

Storage (Volumes)

Attach Datasets, SFS, and PFS to cluster nodes. Manage mounted and unmounted volumes from the Volumes tab.

→

SSH Access & Network Security

Connect to cluster nodes via SSH, manage security groups, and attach reserved public IPs.

→

1. Slurm-Native Scheduling

All Training Cluster workloads are scheduled through Slurm — a job scheduler purpose-built for HPC and large-scale distributed training. It is delivered via Slinky (Kubernetes-native Slurm) and Pyaxis, which remove the need for deep HPC expertise while providing full Slurm capabilities: job queuing, priority management, resource allocation, and multi-node coordination across all nodes in the cluster.

2. Images & Container Support

Pre-Built Images

Training Cluster provides Ubuntu Slurm pre-built images, available in multiple versions. Each image includes:

Pre-installed GPU drivers, CUDA, NCCL, and Slurm runtime
Getting-started and setup scripts for rapid deployment

Select the image and version during cluster creation from the Image dropdowns.

Custom Containers via Enroot

Any Docker or OCI-compatible container image can run on Training Cluster using Enroot. Enroot converts the image into a squash file stored on a shared volume and mounts it at job start — with no per-node image pulling or duplication. All containers run with full CUDA access.

Specify your custom image in the Image field during cluster creation. You can also update the cluster image at any time via Update Image in the Actions menu, without recreating the cluster.

3. Elastic Scaling

Scale your cluster's node count on a live cluster — no teardown, no disruption to running workloads.

Operation	Requires Restart
Scale nodes up	No
Update container image	No
Attach additional storage volumes	Yes

To scale, use Scale Cluster from the Actions menu.

4. High Availability

Training Cluster provides node-level fault tolerance:

Isolated failures — A failure on one node does not affect workloads running on other nodes
Automatic recovery — The cluster resumes automatically after a node restart, without manual intervention

For job-level resilience, checkpoint your training periodically to shared storage so jobs can resume from the last checkpoint after a node failure.

5. Cluster Overview

The Cluster Overview tab provides a live summary of your cluster's resource state and Slurm topology.

Summary Cards

Card	Description
Total GPUs	Total GPU count across all nodes in the cluster
Allocated GPUs	GPUs currently assigned to running jobs
GPU Util	Average GPU utilization across active nodes
GPU Memory	Average GPU memory used per GPU
Running Jobs	Active jobs, with pending queue count
Idle Nodes	Nodes available for workloads

Node Health

Displays the health state of every node in the cluster:

State	Description
IDLE	Node is available and ready for workloads
ALLOCATED	Node is running one or more jobs
MIXED	Node has some resources allocated and some idle
UNKNOWN	Node unavailability or health status is unknown

Slurm Partition Info

Shows the Slurm partition configuration sourced directly from slurmrestd:

Column	Description
Partition	Slurm partition name (e.g., `slinky`, `all`)
Nodes	Number of nodes in the partition
GPU Type	GPU type assigned to the partition
Avail / Total	Available nodes out of total
Default Time	Default job time limit
Max Time	Maximum allowed job time
State	Partition state (UP / DOWN)

6. Jobs

The Jobs tab surfaces full Slurm squeue visibility across all scheduled workloads — no SSH required.

Job Counters

Real-time counts for each job state:

State	Description
Running	Jobs actively executing on nodes
Pending	Jobs queued and waiting for available resources
Completed	Jobs that finished successfully
Failed	Jobs that terminated with an error
Unknown	Jobs in an indeterminate state

Job Table

Column	Description
Job ID	Unique Slurm job identifier
Name	Name of the submitted job
User	User who submitted the job
Partition	Slurm partition the job is assigned to
Nodes	Worker nodes allocated to the job
GPUs	Number of GPUs allocated to the job
Run Time	Elapsed time since job start
Time Limit	Maximum allowed runtime
State	Current job state
Priority	Scheduling priority relative to other jobs

Use the filter tabs to view jobs by state: ALL, RUNNING, PENDING, COMPLETED, FAILED, or UNKNOWN.

7. Monitoring

The Monitoring tab provides per-node GPU metrics with time-interval controls.

Job Summary

Card	Description
Running Jobs	Number of actively executing jobs
Pending Jobs	Number of jobs waiting in queue
Completed	Total completed jobs
CPU Alloc	Latest CPU allocation value

GPU Metrics — All Nodes Overlay

GPU Utilisation % is plotted per node over the selected time interval. Use the time selector to switch between 5m, 15m, 1h, 6h, and 1d windows. Select a specific node from the Select Node dropdown to isolate metrics for a single node.

8. Nodes

The Nodes tab provides health visibility and DCGM (Data Center GPU Manager) metrics for every node in the cluster.

Node Summary Cards

Card	Description
Total Nodes	Total nodes in the cluster pool
Allocated	Nodes currently running jobs
Avg GPU Util	Average GPU utilization across active nodes
Total GPUs	Total GPU capacity and memory

Node Filters

Filter nodes by health state:

Filter	Description
All	Show all nodes
Failed	Nodes that have failed or are unresponsive
XID Errors	Nodes reporting NVIDIA XID hardware errors
Healthy	Nodes operating normally

Click any node row to drill down into per-GPU metrics.

XID Error Visibility & Node Restart

The XID Errors filter surfaces nodes reporting NVIDIA GPU hardware errors in real time, giving you immediate visibility without needing to inspect logs manually. From the same view, you can restart an affected node directly — use Restart All Workers from the Actions menu to recover nodes reporting XID errors without restarting the entire cluster.

9. Logs

The Logs tab provides access to Slurm controller and node logs directly from the dashboard.

Select Replica — Switch between the Slurm controller and individual node replicas
Auto Refresh — Enable continuous log streaming
Filter By — Filter to the last N lines of output
Download — Export logs for offline analysis

10. Shared Storage (Volumes)

The Volumes tab manages storage attached to the cluster. Volumes provide persistent storage accessible from all cluster nodes.

Storage Types

Type	Description
Datasets	Managed datasets with mount/unmount controls per cluster
Shared File System (SFS)	POSIX-compatible shared file storage
Parallel File System (PFS)	High-throughput parallel I/O for large-scale training data

Each storage section shows Mounted and Unmounted volumes, with columns for storage type, encryption, mount status, mount path, and actions.

info

Parallel File System is required at cluster creation time. SFS and Datasets can be attached from the Volumes tab after creation (requires a cluster restart to take effect).

11. Network & Security

The Network & Security tab manages IP addressing and firewall rules for your cluster.

Reserve IP

Attach a reserved public IP to your cluster for stable, persistent external access.

Open the Network & Security tab → Reserve IP.
Select an available IP from the dropdown.
Click Attach IP.

To create a new reserved IP, click Reserve New IP.

Security Groups

Manage the security groups attached to your cluster from the Security Groups sub-section. Security groups control inbound and outbound traffic to all cluster nodes.

Connection Details

The cluster's current Floating IP and SSH command are visible in the Details tab under Connection Details. You can convert the Floating IP to a Reserved IP directly from there.

12. SSH Access

Every cluster node exposes SSH access. The SSH key is selected at cluster creation and can be updated at any time via Update SSH Keys in the Actions menu.

Connect to the Cluster

Open the Cluster Details page.
Click the Connect button (terminal icon) in the top-right area for connection instructions.
Or use the SSH command from the Details tab directly:

ssh root@<floating-ip>

Best Practices for Training Clusters

Save checkpoints to attached storage

Store model checkpoints and training logs on PFS or SFS volumes. Data inside a node's local filesystem is lost on restart — only attached storage persists.

Use the Jobs tab for status checks

The Jobs tab shows full squeue output — running, pending, failed, and completed jobs — without needing to SSH into the cluster.

Checkpoint frequently for job resilience

Enable periodic checkpointing in your training script so jobs can resume from the last checkpoint after an unexpected node restart or failure.

Use Private Cluster for dedicated capacity

Choose Private Cluster in Plan Configuration to reserve nodes exclusively for your workloads — recommended for teams with consistent or long-running training schedules.

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

1. Slurm-Native Scheduling​

2. Images & Container Support​

Pre-Built Images​

Custom Containers via Enroot​

3. Elastic Scaling​

4. High Availability​

5. Cluster Overview​

Summary Cards​

Node Health​

Slurm Partition Info​

6. Jobs​

Job Counters​

Job Table​

7. Monitoring​

Job Summary​

GPU Metrics — All Nodes Overlay​

8. Nodes​

Node Summary Cards​

Node Filters​

XID Error Visibility & Node Restart​

9. Logs​

10. Shared Storage (Volumes)​

Storage Types​

11. Network & Security​

Reserve IP​

Security Groups​

Connection Details​

12. SSH Access​

Connect to the Cluster​

Best Practices for Training Clusters

1. Slurm-Native Scheduling

2. Images & Container Support

Pre-Built Images

Custom Containers via Enroot

3. Elastic Scaling

4. High Availability

5. Cluster Overview

Summary Cards

Node Health

Slurm Partition Info

6. Jobs

Job Counters

Job Table

7. Monitoring

Job Summary

GPU Metrics — All Nodes Overlay

8. Nodes

Node Summary Cards

Node Filters

XID Error Visibility & Node Restart

9. Logs

10. Shared Storage (Volumes)

Storage Types

11. Network & Security

Reserve IP

Security Groups

Connection Details

12. SSH Access

Connect to the Cluster