Troubleshooting and FAQs
FAQs
1. What is the difference between a Private Cluster and an On-Demand Instance?
An On-Demand Instance is a single compute environment billed per hour while running. A Private Cluster is a dedicated pool of GPU nodes billed at a fixed rate regardless of utilization. Private Clusters provide guaranteed availability, multi-project sharing, and no additional charges for services deployed inside them.
2. Can I share a Private Cluster across multiple projects?
Yes. A Private Cluster can be shared across multiple projects. Admins and Owners manage the overall cluster, while Project Managers and Project Leads can access nodes within their scope.
3. Can I stop billing for individual nodes without deleting the cluster?
No. Hourly billing applies to all nodes in the cluster at all times. To pause billing you must delete the cluster entirely. For predictable long-term workloads, consider a committed plan for lower per-node rates.
4. Can I add more nodes to an existing cluster?
Yes. Use Update Configuration from the cluster-level Actions menu to increase the total node count. For committed clusters, newly added nodes are billed at hourly rates regardless of the base commitment.
5. Can I reduce the node count of a cluster?
No. The cluster node count can only be increased, not decreased. To reduce capacity, you must delete the cluster and recreate it with a lower node count.
6. Why can't I deallocate a node?
A node in the Occupied state cannot be deallocated because it is actively running a workload. Stop or delete the workload (Instance, Inference Endpoint, Training Cluster, or Vector Database) running on that node first, then attempt deallocation again.
7. What happens to running workloads if I delete the cluster?
You cannot delete a cluster if there are running resources inside it. First, stop or release all resources created within the cluster. Storage resources (SFS, PFS, Datasets) are billed independently and are not deleted automatically when the cluster is deleted.
8. Is there a refund if I delete a committed cluster early?
No. Committed clusters cannot be partially refunded. The full committed amount is deducted upfront, and no refund is issued for the remaining period upon deletion.
9. Can I convert a committed cluster to hourly billing before the period ends?
No. The billing model cannot be changed during an active commitment period. The post-commitment policy (Auto-Renew, Switch to Hourly, or Auto-Terminate) takes effect only after the commitment period ends.
10. How long does it take to create a Private Cluster?
Private Cluster creation typically takes 1 to 2 minutes depending on the node count and GPU type requested.
11. Can nodes in my Private Cluster run different service types simultaneously?
Yes. Allocated nodes can run different service types across different projects simultaneously. Each node runs the specific workload it has been assigned within its project.
12. What happens to my Private Cluster if my account is deprovisioned?
If your account is deprovisioned, all Private Clusters and their associated workloads are permanently deleted. Ensure all critical data is backed up before account deprovisioning.
Troubleshooting
1. Cluster Stuck in Creating State
Cause: The requested GPU hardware inventory was not immediately available or provisioning is taking longer than usual.
Resolution: Wait up to 10 minutes for the cluster to reach a Running state. If the cluster remains in a Creating state beyond 10 minutes, note the Cluster ID and contact support.
2. Cannot Deallocate a Node
Cause: The node is in an Occupied state, meaning a workload is actively running on it.
Resolution:
- Navigate to Cluster Nodes and identify the service running on the node.
- Go to the respective service (Instance, Inference Endpoint, etc.) and stop or delete it.
- Wait for the node status to change from Occupied to Allocated.
- Retry the deallocation.
3. Node Allocation Not Reflected in Project
Cause: Permission mismatch or a brief propagation delay after allocation.
Resolution:
- Verify you have the required role (Admin, Owner) to allocate nodes to the target project.
- Refresh the Project Allocation view.
- If the issue persists after 2 minutes, contact support with your Cluster ID and Project ID.
4. Workload Cannot Start on Allocated Node
Cause: The node was allocated but may have entered an unhealthy state, or the workload configuration has a resource mismatch.
Resolution:
- Check the node status in Cluster Nodes — confirm it shows Allocated (not Occupied or Free).
- Review the workload configuration to ensure it matches the node's GPU and memory specifications.
- If the node shows unhealthy metrics in Node Monitoring, deallocate it and allocate a different Free node.
- Contact support if the issue persists.
System Health Best Practices
- Monitor Before Deallocating: Always check the Cluster Nodes view to confirm a node is not Occupied before deallocating.
- Delete Workloads Before Cluster: Stop all running services before deleting the cluster to ensure a clean resource release.
- Contact Support for Stuck States: If your cluster or node remains in a transitional state for more than 15 minutes, note the Cluster ID and contact the support team.