Features
Slurm-Native Scheduling
All workloads scheduled through Slurm via Slinky and Pyaxis — purpose-built for HPC and large-scale distributed training.
→Images & Container Support
Run Ubuntu Slurm pre-built images or bring any Docker/OCI container via Enroot — no per-node image pulling.
→Elastic Scaling & High Availability
Scale nodes on a live cluster without teardown. Node failures are isolated and the cluster recovers automatically.
→Jobs & Monitoring
Full squeue visibility in the Jobs tab, per-node GPU metrics in Monitoring, and DCGM node health in the Nodes tab.
→Storage (Volumes)
Attach Datasets, SFS, and PFS to cluster nodes. Manage mounted and unmounted volumes from the Volumes tab.
→SSH Access & Network Security
Connect to cluster nodes via SSH, manage security groups, and attach reserved public IPs.
→1. Slurm-Native Scheduling
All Training Cluster workloads are scheduled through Slurm — a job scheduler purpose-built for HPC and large-scale distributed training. It is delivered via Slinky (Kubernetes-native Slurm) and Pyaxis, which remove the need for deep HPC expertise while providing full Slurm capabilities: job queuing, priority management, resource allocation, and multi-node coordination across all nodes in the cluster.
2. Images & Container Support
Pre-Built Images
Training Cluster provides Ubuntu Slurm pre-built images, available in multiple versions. Each image includes:
- Pre-installed GPU drivers, CUDA, NCCL, and Slurm runtime
- Getting-started and setup scripts for rapid deployment
Select the image and version during cluster creation from the Image dropdowns.
Custom Containers via Enroot
Any Docker or OCI-compatible container image can run on Training Cluster using Enroot. Enroot converts the image into a squash file stored on a shared volume and mounts it at job start — with no per-node image pulling or duplication. All containers run with full CUDA access.
Specify your custom image in the Image field during cluster creation. You can also update the cluster image at any time via Update Image in the Actions menu, without recreating the cluster.
3. Elastic Scaling
Scale your cluster's node count on a live cluster — no teardown, no disruption to running workloads.
| Operation | Requires Restart |
|---|---|
| Scale nodes up | No |
| Update container image | No |
| Attach additional storage volumes | Yes |
To scale, use Scale Cluster from the Actions menu.
4. High Availability
Training Cluster provides node-level fault tolerance:
- Isolated failures — A failure on one node does not affect workloads running on other nodes
- Automatic recovery — The cluster resumes automatically after a node restart, without manual intervention
For job-level resilience, checkpoint your training periodically to shared storage so jobs can resume from the last checkpoint after a node failure.
5. Cluster Overview
The Cluster Overview tab provides a live summary of your cluster's resource state and Slurm topology.
Summary Cards
| Card | Description |
|---|---|
| Total GPUs | Total GPU count across all nodes in the cluster |
| Allocated GPUs | GPUs currently assigned to running jobs |
| GPU Util | Average GPU utilization across active nodes |
| GPU Memory | Average GPU memory used per GPU |
| Running Jobs | Active jobs, with pending queue count |
| Idle Nodes | Nodes available for workloads |
Node Health
Displays the health state of every node in the cluster:
| State | Description |
|---|---|
| IDLE | Node is available and ready for workloads |
| ALLOCATED | Node is running one or more jobs |
| MIXED | Node has some resources allocated and some idle |
| UNKNOWN | Node unavailability or health status is unknown |
Slurm Partition Info
Shows the Slurm partition configuration sourced directly from slurmrestd:
| Column | Description |
|---|---|
| Partition | Slurm partition name (e.g., slinky, all) |
| Nodes | Number of nodes in the partition |
| GPU Type | GPU type assigned to the partition |
| Avail / Total | Available nodes out of total |
| Default Time | Default job time limit |
| Max Time | Maximum allowed job time |
| State | Partition state (UP / DOWN) |
6. Jobs
The Jobs tab surfaces full Slurm squeue visibility across all scheduled workloads — no SSH required.
Job Counters
Real-time counts for each job state:
| State | Description |
|---|---|
| Running | Jobs actively executing on nodes |
| Pending | Jobs queued and waiting for available resources |
| Completed | Jobs that finished successfully |
| Failed | Jobs that terminated with an error |
| Unknown | Jobs in an indeterminate state |
Job Table
| Column | Description |
|---|---|
| Job ID | Unique Slurm job identifier |
| Name | Name of the submitted job |
| User | User who submitted the job |
| Partition | Slurm partition the job is assigned to |
| Nodes | Worker nodes allocated to the job |
| GPUs | Number of GPUs allocated to the job |
| Run Time | Elapsed time since job start |
| Time Limit | Maximum allowed runtime |
| State | Current job state |
| Priority | Scheduling priority relative to other jobs |
Use the filter tabs to view jobs by state: ALL, RUNNING, PENDING, COMPLETED, FAILED, or UNKNOWN.
7. Monitoring
The Monitoring tab provides per-node GPU metrics with time-interval controls.
Job Summary
| Card | Description |
|---|---|
| Running Jobs | Number of actively executing jobs |
| Pending Jobs | Number of jobs waiting in queue |
| Completed | Total completed jobs |
| CPU Alloc | Latest CPU allocation value |
GPU Metrics — All Nodes Overlay
GPU Utilisation % is plotted per node over the selected time interval. Use the time selector to switch between 5m, 15m, 1h, 6h, and 1d windows. Select a specific node from the Select Node dropdown to isolate metrics for a single node.
8. Nodes
The Nodes tab provides health visibility and DCGM (Data Center GPU Manager) metrics for every node in the cluster.
Node Summary Cards
| Card | Description |
|---|---|
| Total Nodes | Total nodes in the cluster pool |
| Allocated | Nodes currently running jobs |
| Avg GPU Util | Average GPU utilization across active nodes |
| Total GPUs | Total GPU capacity and memory |
Node Filters
Filter nodes by health state:
| Filter | Description |
|---|---|
| All | Show all nodes |
| Failed | Nodes that have failed or are unresponsive |
| XID Errors | Nodes reporting NVIDIA XID hardware errors |
| Healthy | Nodes operating normally |
Click any node row to drill down into per-GPU metrics.
XID Error Visibility & Node Restart
The XID Errors filter surfaces nodes reporting NVIDIA GPU hardware errors in real time, giving you immediate visibility without needing to inspect logs manually. From the same view, you can restart an affected node directly — use Restart All Workers from the Actions menu to recover nodes reporting XID errors without restarting the entire cluster.
9. Logs
The Logs tab provides access to Slurm controller and node logs directly from the dashboard.
- Select Replica — Switch between the Slurm controller and individual node replicas
- Auto Refresh — Enable continuous log streaming
- Filter By — Filter to the last N lines of output
- Download — Export logs for offline analysis
10. Shared Storage (Volumes)
The Volumes tab manages storage attached to the cluster. Volumes provide persistent storage accessible from all cluster nodes.
Storage Types
| Type | Description |
|---|---|
| Datasets | Managed datasets with mount/unmount controls per cluster |
| Shared File System (SFS) | POSIX-compatible shared file storage |
| Parallel File System (PFS) | High-throughput parallel I/O for large-scale training data |
Each storage section shows Mounted and Unmounted volumes, with columns for storage type, encryption, mount status, mount path, and actions.
Parallel File System is required at cluster creation time. SFS and Datasets can be attached from the Volumes tab after creation (requires a cluster restart to take effect).
11. Network & Security
The Network & Security tab manages IP addressing and firewall rules for your cluster.
Reserve IP
Attach a reserved public IP to your cluster for stable, persistent external access.
- Open the Network & Security tab → Reserve IP.
- Select an available IP from the dropdown.
- Click Attach IP.
To create a new reserved IP, click Reserve New IP.
Security Groups
Manage the security groups attached to your cluster from the Security Groups sub-section. Security groups control inbound and outbound traffic to all cluster nodes.
Connection Details
The cluster's current Floating IP and SSH command are visible in the Details tab under Connection Details. You can convert the Floating IP to a Reserved IP directly from there.
12. SSH Access
Every cluster node exposes SSH access. The SSH key is selected at cluster creation and can be updated at any time via Update SSH Keys in the Actions menu.
Connect to the Cluster
- Open the Cluster Details page.
- Click the Connect button (terminal icon) in the top-right area for connection instructions.
- Or use the SSH command from the Details tab directly:
ssh root@<floating-ip>
Best Practices for Training Clusters
Store model checkpoints and training logs on PFS or SFS volumes. Data inside a node's local filesystem is lost on restart — only attached storage persists.
The Jobs tab shows full squeue output — running, pending, failed, and completed jobs — without needing to SSH into the cluster.
Enable periodic checkpointing in your training script so jobs can resume from the last checkpoint after an unexpected node restart or failure.
Choose Private Cluster in Plan Configuration to reserve nodes exclusively for your workloads — recommended for teams with consistent or long-running training schedules.