Skip to main content

Actions

Manage your Training Cluster through the Actions menu and the quick-action buttons on the Cluster Details page.


Actions Menu

The Actions menu is available in two places:

  • Training Cluster list — click the ⋮ icon on any cluster row.
  • Cluster Details page — click the Actions button in the top-right area.
ActionDescription
Update SSH KeysAdd or replace SSH keys on the cluster nodes
Update ImageChange the container image running on cluster nodes without recreating the cluster
Scale ClusterIncrease the number of nodes on a running cluster
Convert to CommittedSwitch from On-Demand hourly pricing to a Committed plan
Restart All WorkersRestart all compute nodes in the cluster simultaneously
Restart ClusterRestart the full cluster, including the Slurm controller and all nodes
Clone ClusterCreate a new cluster with the same configuration
Terminate ClusterStop the cluster and release all associated resources
Delete Training ClusterPermanently delete a terminated cluster and all its records
danger

Terminating a cluster is irreversible. Ensure no active workloads are running before terminating.

warning

Only terminated clusters can be deleted. Terminate the cluster first if it is still running.


Quick Actions

The top-right area of the Cluster Details page also provides direct shortcut buttons:

ButtonDescription
RefreshReload the current cluster status and metrics
ConnectOpens a connection help panel with SSH instructions for the cluster
Restart All WorkersRestart all compute nodes without restarting the Slurm controller
Restart ClusterRestart the entire cluster including the Slurm controller

Cluster Tabs Reference

Each cluster's detail page is organized into the following tabs:

TabDescription
DetailsCluster name, image version, status, node count, created by/at; Plan details (plan name, price, CPU, memory, GPU); Connection details (SSH keys, Floating IP, SSH command)
Cluster OverviewGPU and job summary cards, node health status (IDLE / ALLOCATED / MIXED / UNKNOWN), and Slurm partition table
NodesNode-level DCGM metrics; filter nodes by All, Failed, XID Errors, or Healthy
JobsAll Slurm jobs with Running / Pending / Completed / Failed / Unknown counters and a detailed job table
MonitoringPer-node GPU metrics overlay with time-interval controls (5m, 15m, 1h, 6h, 1d) and job summary stats
LogsSlurm controller and node logs; select replica, set auto-refresh, and filter by last N lines
VolumesManage attached storage — Datasets, Shared File System (SFS), and Parallel File System (PFS)
Network & SecurityVPC configuration, Reserve IP management, and Security Group assignment