Skip to main content

Features

1. Node Lifecycle States

Every node in a Private Cluster transitions through the following states:

StateDescription
FreeAvailable for allocation to a project
AllocatedAssigned to a project but not yet running a workload
OccupiedActively running a workload (Node, Inference Endpoint, Training Cluster, or Vector Database)
Occupied nodes cannot be deallocated. The running workload must be stopped or deleted first before the node can be freed.

Understanding these states helps you manage cluster capacity without disrupting running services.

2. Node Monitoring

The Node Monitoring view provides real-time visibility into the health and performance of every node in your cluster.

Available Metrics

MetricDescription
GPU UsageCurrent GPU utilization as a percentage
MemoryMemory consumption and availability
UptimeNode uptime percentage indicating reliability
PowerCurrent power consumption in watts

How to Access

  1. Open your Private Cluster.
  2. Navigate to the Cluster Nodes tab.
  3. Select a node to view its detailed metrics.

Benefits

  • Proactive Management – Identify underutilized or overloaded nodes before issues arise
  • Cost Optimization – Make informed allocation decisions based on real usage data
  • Performance Tracking – Ensure service reliability through uptime monitoring

3. Access Control

Private Cluster access is governed by IAM roles, ensuring only authorized users can manage cluster capacity and node allocation.

Role-Based Access Matrix

RoleView ClusterCreate / Update ClusterAllocate NodesDeallocate NodesScope
Admin / OwnerYesYesYesYesCluster (CRN)
Project ManagerYesNoNoNoAssigned projects
Project LeadYesNoNoNoAssigned project only
MemberYes (Read-only)NoNoNoAs per IAM policy

Common Access Scenarios

  • I am a Project Manager and want to free GPUs from a project → Not allowed
  • I am a Project Lead and want to resize the cluster → Not allowed
  • I am a Member and want to view cluster usage → Allowed
  • I am an Admin and want to allocate nodes to a project → Allowed

4. Node Allocation

Node allocation controls how GPU resources are distributed across your projects.

Allocation Flow

Cluster Created


Nodes: Free ──► Allocate to Project ──► Nodes: Allocated


Launch Workload ──► Nodes: Occupied

Key Rules

  • Free nodes can be allocated to any project by users with the appropriate permissions.
  • Allocated nodes can be deallocated as long as no workload is running on them.
  • Occupied nodes cannot be deallocated — stop or delete the workload first.
  • Allocation and deallocation can be done from both the Project Allocation and Cluster Nodes views.

5. Multi-Project Sharing

A single Private Cluster can serve multiple projects simultaneously. Each project gets its own slice of the cluster without sharing workload environments.

Example

A 10-node cluster shared across three projects:

ProjectNodes AllocatedWorkloads Running
Project A (model training)44 Training Clusters
Project B (inference)33 Inference Endpoints
Free pool3
Billing remains fixed regardless of how nodes are distributed. GPUs can be reallocated at any time without redeploying infrastructure.

Best Practices for Private Clusters

Right-size your cluster

Start with the minimum node count you need and use Update Configuration to add nodes as workloads grow — you cannot reduce node count without deleting the cluster.

Deallocate idle nodes

Return unneeded nodes to the Free pool so other projects can use them without resizing the cluster.

Use committed plans for predictable loads

Committed pricing offers lower per-node rates for steady-state training and inference workloads.

Monitor before deallocating

Check the Cluster Nodes view to confirm a node is not Occupied before attempting to deallocate — occupied nodes cannot be removed until the workload is stopped.