Actions
Manage your Training Cluster through the Actions menu and the quick-action buttons on the Cluster Details page.
Actions Menu
The Actions menu is available in two places:
- Training Cluster list — click the ⋮ icon on any cluster row.
- Cluster Details page — click the Actions button in the top-right area.
| Action | Description |
|---|---|
| Update SSH Keys | Add or replace SSH keys on the cluster nodes |
| Update Image | Change the container image running on cluster nodes without recreating the cluster |
| Scale Cluster | Increase the number of nodes on a running cluster |
| Convert to Committed | Switch from On-Demand hourly pricing to a Committed plan |
| Restart All Workers | Restart all compute nodes in the cluster simultaneously |
| Restart Cluster | Restart the full cluster, including the Slurm controller and all nodes |
| Clone Cluster | Create a new cluster with the same configuration |
| Terminate Cluster | Stop the cluster and release all associated resources |
| Delete Training Cluster | Permanently delete a terminated cluster and all its records |
danger
Terminating a cluster is irreversible. Ensure no active workloads are running before terminating.
warning
Only terminated clusters can be deleted. Terminate the cluster first if it is still running.
Quick Actions
The top-right area of the Cluster Details page also provides direct shortcut buttons:
| Button | Description |
|---|---|
| Refresh | Reload the current cluster status and metrics |
| Connect | Opens a connection help panel with SSH instructions for the cluster |
| Restart All Workers | Restart all compute nodes without restarting the Slurm controller |
| Restart Cluster | Restart the entire cluster including the Slurm controller |
Cluster Tabs Reference
Each cluster's detail page is organized into the following tabs:
| Tab | Description |
|---|---|
| Details | Cluster name, image version, status, node count, created by/at; Plan details (plan name, price, CPU, memory, GPU); Connection details (SSH keys, Floating IP, SSH command) |
| Cluster Overview | GPU and job summary cards, node health status (IDLE / ALLOCATED / MIXED / UNKNOWN), and Slurm partition table |
| Nodes | Node-level DCGM metrics; filter nodes by All, Failed, XID Errors, or Healthy |
| Jobs | All Slurm jobs with Running / Pending / Completed / Failed / Unknown counters and a detailed job table |
| Monitoring | Per-node GPU metrics overlay with time-interval controls (5m, 15m, 1h, 6h, 1d) and job summary stats |
| Logs | Slurm controller and node logs; select replica, set auto-refresh, and filter by last N lines |
| Volumes | Manage attached storage — Datasets, Shared File System (SFS), and Parallel File System (PFS) |
| Network & Security | VPC configuration, Reserve IP management, and Security Group assignment |