Skip to main content

Training Cluster

The Training Cluster provides a dedicated environment for running training workloads with predefined allocations of RAM, CPU, and GPU resources. Pricing for a Training Cluster is fixed and does not vary based on the percentage of resource utilization. Additionally, creating and running deployments within a Training Cluster does not incur any extra charges.


Create a Training Cluster

To create a new Training Cluster:

  1. From the sidebar, select Training Cluster.
  2. Click the Create Cluster button.
  3. Enter a Cluster Name.
    • A random name (based on the current timestamp) is provided by default, but you can modify it as needed.
  4. Select the desired Cluster Configuration plan.
    • You can choose between an E2E-provided Cluster or your own Private Cluster, depending on your requirements.
    • Configuration details include:
      • Number of Nodes
      • GPU Memory
      • RAM
      • CPU
      • GPU
      • Price
  5. After selecting a plan, click Create to launch your Training Cluster.

Once the cluster is created, it will appear in the Training Cluster list, showing its configuration and current status.


Manage Training Cluster

  • Upgrade Plan – Select the Update Plan option under the Actions to modify your existing cluster plan or configuration.
  • Terminate Cluster – Select the Terminate Cluster option under the Actions to delete your Training Cluster when it is no longer required.

Overview

In the Overview section, you can view detailed information about your Training Cluster, including:

  • Cluster Name
  • Number of Nodes
  • Plan Name
  • Cluster Node Configuration, displaying the allocated GPU, CPU, and RAM resources.

This view provides a quick summary of your cluster’s setup and resource distribution.


Monitoring

The Monitoring section provides real-time insights into your cluster’s performance metrics.
You can track the following parameters for each node within the Training Cluster:

  • Disk Usage
  • Memory Usage
  • GPU Utilization
  • GPU Memory Utilization
  • GPU Temperature
  • GPU Power Usage
  • CPU Utilization
  • Memory Utilization
  • Disk Total Read Bytes
  • Disk Total Write Bytes

These metrics help ensure optimal utilization and health of your Training Cluster resources.