Features
Node Lifecycle States
Understand Free, Allocated, and Occupied node states to manage cluster capacity.
→Node Monitoring
Real-time CPU, memory, uptime, and power metrics for every node in your cluster.
→Access Control
Role-based permissions matrix governing cluster creation, allocation, and visibility.
→Node Allocation
Distribute GPU nodes across projects and reclaim them dynamically without redeployment.
→Multi-Project Sharing
Share a single cluster across multiple teams and projects with isolated billing.
→Multi-Instance GPU (MIG)
Partition a single GPU into isolated instances to maximize utilization across workloads.
→1. Node Lifecycle States
Every node in a Private Cluster transitions through the following states:
| State | Description |
|---|---|
| Free | Available for allocation to a project |
| Allocated | Assigned to a project but not yet running a workload |
| Occupied | Actively running a workload (Node, Inference Endpoint, Training Cluster, or Vector Database) |
Understanding these states helps you manage cluster capacity without disrupting running services.
2. Node Monitoring
The Node Monitoring view provides real-time visibility into the health and performance of every node in your cluster.
Available Metrics
| Metric | Description |
|---|---|
| GPU Usage | Current GPU utilization as a percentage |
| Memory | Memory consumption and availability |
| Uptime | Node uptime percentage indicating reliability |
| Power | Current power consumption in watts |
How to Access
- Open your Private Cluster.
- Navigate to the Cluster Nodes tab.
- Select a node to view its detailed metrics.
Benefits
- Proactive Management – Identify underutilized or overloaded nodes before issues arise
- Cost Optimization – Make informed allocation decisions based on real usage data
- Performance Tracking – Ensure service reliability through uptime monitoring
3. Access Control
Private Cluster access is governed by IAM roles, ensuring only authorized users can manage cluster capacity and node allocation.
Role-Based Access Matrix
| Role | View Cluster | Create / Update Cluster | Allocate Nodes | Deallocate Nodes | Scope |
|---|---|---|---|---|---|
| Admin / Owner | Yes | Yes | Yes | Yes | Cluster (CRN) |
| Project Manager | Yes | No | No | No | Assigned projects |
| Project Lead | Yes | No | No | No | Assigned project only |
| Member | Yes (Read-only) | No | No | No | As per IAM policy |
Common Access Scenarios
- I am a Project Manager and want to free GPUs from a project → Not allowed
- I am a Project Lead and want to resize the cluster → Not allowed
- I am a Member and want to view cluster usage → Allowed
- I am an Admin and want to allocate nodes to a project → Allowed
4. Node Allocation
Node allocation controls how GPU resources are distributed across your projects.
Allocation Flow
Cluster Created
│
▼
Nodes: Free ──► Allocate to Project ──► Nodes: Allocated
│
▼
Launch Workload ──► Nodes: Occupied
Key Rules
- Free nodes can be allocated to any project by users with the appropriate permissions.
- Allocated nodes can be deallocated as long as no workload is running on them.
- Occupied nodes cannot be deallocated — stop or delete the workload first.
- Allocation and deallocation can be done from both the Project Allocation and Cluster Nodes views.
5. Multi-Project Sharing
A single Private Cluster can serve multiple projects simultaneously. Each project gets its own slice of the cluster without sharing workload environments.
Example
A 10-node cluster shared across three projects:
| Project | Nodes Allocated | Workloads Running |
|---|---|---|
| Project A (model training) | 4 | 4 Training Clusters |
| Project B (inference) | 3 | 3 Inference Endpoints |
| Free pool | 3 | — |
Best Practices for Private Clusters
Start with the minimum node count you need and use Update Configuration to add nodes as workloads grow — you cannot reduce node count without deleting the cluster.
Return unneeded nodes to the Free pool so other projects can use them without resizing the cluster.
Committed pricing offers lower per-node rates for steady-state training and inference workloads.
Check the Cluster Nodes view to confirm a node is not Occupied before attempting to deallocate — occupied nodes cannot be removed until the workload is stopped.