Skip to main content

Alert management

Alert management lets you monitor resource metrics and receive notifications when a condition you define is met. Alerts can be configured for Instance, Inference, and Training Cluster services.

Notifications can be delivered via email, Slack, or an external endpoint using webhooks.

Alerts are accessible from two locations:

  • Global view : left sidebar → Alert Management
  • Resource view : resource detail page → Alert tab

How it works

Setting up alerts follows a three-step workflow:

  1. Create a monitoring integration : configure where alerts are sent (email, Slack, or webhook).
  2. Create an alert : define the metric, condition, and severity, then link it to a monitoring integration.
  3. Attach to a resource : bind the alert to a running service to activate monitoring.

Note: Use the Test button when configuring a monitoring integration to confirm notifications are being delivered correctly before going live.


Prerequisites

Before creating an alert, ensure the following are in place:

  • At least one monitoring integration has been configured.
  • An active Instance, Inference, or Training Cluster service is running.
  • For Slack: you have an Incoming Webhook URL, or a Bot Token and Channel ID.
  • For webhooks: your endpoint is reachable and can accept POST requests with a JSON body.

Core concepts

TermDescription
AlertA rule that triggers a notification when a monitored metric meets a defined condition.
Monitoring integrationA named collection of destinations where alert notifications are sent.
Alert trigger typeThe specific resource metric being evaluated (e.g. CPU utilization, GPU temperature).
Threshold valueThe numeric value evaluated against the operator to determine when the alert fires.
OperatorThe comparison condition applied to the threshold (e.g. greater than, less than).
SeverityIndicates priority level. Possible values: Critical, Warning, Info.

Supported metrics

MetricUnitPurposeInstanceInferenceTraining Cluster
CPU utilization%Detect compute saturation
Memory utilization%Identify memory pressure or OOM risk
Disk utilization%Prevent storage exhaustion on volumes
Ephemeral storage%Monitor temporary container storage
GPU temperature°CProtect against thermal throttling or damage
GPU powerWDetect abnormal power consumption
XID errorGPU error detection

Monitoring integrations

A monitoring integration defines where alert notifications are delivered. Each integration can include one or more destinations: email, Slack, or webhook.

1. Email

Email is the default notification channel. When an alert condition is met, a notification is sent to all email addresses configured in the integration.

  • At least one valid email address is required.
  • Multiple addresses can be added to notify all relevant stakeholders.
  • Each email includes the resource name, triggered metric, severity level, and event timestamp.

2. Slack

Slack notifications deliver alerts directly to a channel in your Slack workspace. Two connection methods are supported:

MethodCredentials requiredBest for
WebhookIncoming Webhook URLSingle channel, simple setup
TokenBot Token + Channel IDMultiple channels, programmatic control

2.1 Webhook method

The webhook method uses Slack's Incoming Webhooks feature. It generates a unique URL that posts messages to a specific channel. This is the recommended starting point for teams needing simple notification delivery without complex bot permissions.

2.2 Token method

The token method uses a Slack app Bot Token to post messages on behalf of a configured app. It requires the chat:write scope to be enabled under OAuth & Permissions in your Slack app settings.

Note: When using the token method, you must invite the bot to the target channel by running /invite @YourAppName in Slack. If the bot is not a member of the channel, notifications will fail silently with no error returned.

Choosing a method

Use the webhook method for a straightforward, single-channel setup. Use the token method if you need to send alerts to multiple channels or require stricter control over app permissions and bot identity.

For step-by-step setup instructions, see Slack alerts.

3. Webhook

Webhook alerts send a POST request to an external URL each time an alert is triggered or resolved. Use this to trigger automated workflows in external tools in real time.

Technical requirements for your endpoint:

  • Must accept POST requests with Content-Type: application/json.
  • Must return a 2xx HTTP status code to confirm successful receipt.
  • If authentication is required, validate the Authorization header before processing the request body.

For the full payload schema and field reference, see Webhook alerts.


Managing Existing Monitoring Integration

From Manage Monitoring Integration, all Integration are listed with their description, member count, attached alerts, creator, and creation timestamp. Use the Actions menu to Update or Delete a group.


Alerts

Create an alert

Go to Alert Management → Create Alert.

FieldDescriptionExample
Alert nameAuto-generated by default. Replace with a descriptive name.prod-gpu-temp-critical
Service typeThe service the alert applies to. Cannot be changed after creation.Instance, Inference, Training Cluster
Monitoring integrationThe integration that contains your notification destinations.ops-slack-webhook
Alert trigger typeThe metric to monitor.CPU utilization, GPU temperature
SeverityPriority level of the alert.Critical, Warning, Info
OperatorComparison condition applied to the threshold.Greater than, Less than, Equal to
Threshold valueThe numeric value that triggers the alert.85 for 85% CPU utilization

Click Create to save.

Manage existing alerts

Go to Alert Management → Manage Alerts. Use the Actions menu to update or delete an alert.

  • Update : the following fields can be modified: alert name, severity, alert trigger type, operator, threshold value, and monitoring integration. Service type cannot be changed after creation.
  • Delete : permanently removes the alert. No further notifications will be sent to the associated destinations.

Attach an alert to a resource

Alerts can only be attached to resources in the Running state.

ResourcePath
InstanceInstances → select instance → Alert tab → select alert → Attach
InferenceInference → select service → Alert tab → select alert → Attach
Training ClusterTraining Cluster → select job → Alert tab → select alert → Attach

To create a new alert directly from a resource view, click Click here in the Alerts panel. The Create Alert form opens with the service type pre-set to the current resource.


  • Capacity planning : set Memory and Disk utilization alerts at 75–80% to get early warnings before resource saturation, giving teams time to scale or optimize.
  • GPU health for ML workloads : configure GPU temperature and GPU power alerts to detect overheating or abnormal power draw during long training runs.
  • Inference container storage : use ephemeral storage alerts to prevent silent failures caused by container-level storage exhaustion during inference workloads.
  • Incident response : align monitoring integrations with on-call rotations so the right engineers are notified immediately, reducing mean time to recovery.

Quick reference

TaskPath
Open alert managementLeft sidebar → Alert Management
View all alertsAlert Management → Manage Alerts
Create a new alertAlert Management → Create Alert
Manage monitoring integrationsAlert Management → Manage Monitoring Integration
Attach alert to an instanceInstances → select instance → Alert tab → select alert → Attach
Attach alert to an inference serviceInference → select service → Alert tab → select alert → Attach
Attach alert to a training clusterTraining Cluster → select job → Alert tab → select alert → Attach