Alert management
Alert management lets you monitor resource metrics and receive notifications when a condition you define is met. Alerts can be configured for Instance, Inference, and Training Cluster services.
Notifications can be delivered via email, Slack, or an external endpoint using webhooks.
Alerts are accessible from two locations:
- Global view : left sidebar → Alert Management
- Resource view : resource detail page → Alert tab
How it works
Setting up alerts follows a three-step workflow:
- Create a monitoring integration : configure where alerts are sent (email, Slack, or webhook).
- Create an alert : define the metric, condition, and severity, then link it to a monitoring integration.
- Attach to a resource : bind the alert to a running service to activate monitoring.
Note: Use the Test button when configuring a monitoring integration to confirm notifications are being delivered correctly before going live.
Prerequisites
Before creating an alert, ensure the following are in place:
- At least one monitoring integration has been configured.
- An active Instance, Inference, or Training Cluster service is running.
- For Slack: you have an Incoming Webhook URL, or a Bot Token and Channel ID.
- For webhooks: your endpoint is reachable and can accept POST requests with a JSON body.
Core concepts
| Term | Description |
|---|---|
| Alert | A rule that triggers a notification when a monitored metric meets a defined condition. |
| Monitoring integration | A named collection of destinations where alert notifications are sent. |
| Alert trigger type | The specific resource metric being evaluated (e.g. CPU utilization, GPU temperature). |
| Threshold value | The numeric value evaluated against the operator to determine when the alert fires. |
| Operator | The comparison condition applied to the threshold (e.g. greater than, less than). |
| Severity | Indicates priority level. Possible values: Critical, Warning, Info. |
Supported metrics
| Metric | Unit | Purpose | Instance | Inference | Training Cluster |
|---|---|---|---|---|---|
| CPU utilization | % | Detect compute saturation | ✓ | ✓ | ✓ |
| Memory utilization | % | Identify memory pressure or OOM risk | ✓ | ✓ | ✓ |
| Disk utilization | % | Prevent storage exhaustion on volumes | ✓ | ||
| Ephemeral storage | % | Monitor temporary container storage | ✓ | ||
| GPU temperature | °C | Protect against thermal throttling or damage | ✓ | ✓ | ✓ |
| GPU power | W | Detect abnormal power consumption | ✓ | ✓ | ✓ |
| XID error | — | GPU error detection | ✓ |
Monitoring integrations
A monitoring integration defines where alert notifications are delivered. Each integration can include one or more destinations: email, Slack, or webhook.
1. Email
Email is the default notification channel. When an alert condition is met, a notification is sent to all email addresses configured in the integration.
- At least one valid email address is required.
- Multiple addresses can be added to notify all relevant stakeholders.
- Each email includes the resource name, triggered metric, severity level, and event timestamp.
2. Slack
Slack notifications deliver alerts directly to a channel in your Slack workspace. Two connection methods are supported:
| Method | Credentials required | Best for |
|---|---|---|
| Webhook | Incoming Webhook URL | Single channel, simple setup |
| Token | Bot Token + Channel ID | Multiple channels, programmatic control |
2.1 Webhook method
The webhook method uses Slack's Incoming Webhooks feature. It generates a unique URL that posts messages to a specific channel. This is the recommended starting point for teams needing simple notification delivery without complex bot permissions.
2.2 Token method
The token method uses a Slack app Bot Token to post messages on behalf of a configured app. It requires the chat:write scope to be enabled under OAuth & Permissions in your Slack app settings.
Note: When using the token method, you must invite the bot to the target channel by running
/invite @YourAppNamein Slack. If the bot is not a member of the channel, notifications will fail silently with no error returned.
Choosing a method
Use the webhook method for a straightforward, single-channel setup. Use the token method if you need to send alerts to multiple channels or require stricter control over app permissions and bot identity.
For step-by-step setup instructions, see Slack alerts.
3. Webhook
Webhook alerts send a POST request to an external URL each time an alert is triggered or resolved. Use this to trigger automated workflows in external tools in real time.
Technical requirements for your endpoint:
- Must accept POST requests with
Content-Type: application/json. - Must return a
2xxHTTP status code to confirm successful receipt. - If authentication is required, validate the
Authorizationheader before processing the request body.
For the full payload schema and field reference, see Webhook alerts.
Managing Existing Monitoring Integration
From Manage Monitoring Integration, all Integration are listed with their description, member count, attached alerts, creator, and creation timestamp. Use the Actions menu to Update or Delete a group.
Alerts
Create an alert
Go to Alert Management → Create Alert.
| Field | Description | Example |
|---|---|---|
| Alert name | Auto-generated by default. Replace with a descriptive name. | prod-gpu-temp-critical |
| Service type | The service the alert applies to. Cannot be changed after creation. | Instance, Inference, Training Cluster |
| Monitoring integration | The integration that contains your notification destinations. | ops-slack-webhook |
| Alert trigger type | The metric to monitor. | CPU utilization, GPU temperature |
| Severity | Priority level of the alert. | Critical, Warning, Info |
| Operator | Comparison condition applied to the threshold. | Greater than, Less than, Equal to |
| Threshold value | The numeric value that triggers the alert. | 85 for 85% CPU utilization |
Click Create to save.
Manage existing alerts
Go to Alert Management → Manage Alerts. Use the Actions menu to update or delete an alert.
- Update : the following fields can be modified: alert name, severity, alert trigger type, operator, threshold value, and monitoring integration. Service type cannot be changed after creation.
- Delete : permanently removes the alert. No further notifications will be sent to the associated destinations.
Attach an alert to a resource
Alerts can only be attached to resources in the Running state.
| Resource | Path |
|---|---|
| Instance | Instances → select instance → Alert tab → select alert → Attach |
| Inference | Inference → select service → Alert tab → select alert → Attach |
| Training Cluster | Training Cluster → select job → Alert tab → select alert → Attach |
To create a new alert directly from a resource view, click Click here in the Alerts panel. The Create Alert form opens with the service type pre-set to the current resource.
Recommended usage
- Capacity planning : set Memory and Disk utilization alerts at 75–80% to get early warnings before resource saturation, giving teams time to scale or optimize.
- GPU health for ML workloads : configure GPU temperature and GPU power alerts to detect overheating or abnormal power draw during long training runs.
- Inference container storage : use ephemeral storage alerts to prevent silent failures caused by container-level storage exhaustion during inference workloads.
- Incident response : align monitoring integrations with on-call rotations so the right engineers are notified immediately, reducing mean time to recovery.
Quick reference
| Task | Path |
|---|---|
| Open alert management | Left sidebar → Alert Management |
| View all alerts | Alert Management → Manage Alerts |
| Create a new alert | Alert Management → Create Alert |
| Manage monitoring integrations | Alert Management → Manage Monitoring Integration |
| Attach alert to an instance | Instances → select instance → Alert tab → select alert → Attach |
| Attach alert to an inference service | Inference → select service → Alert tab → select alert → Attach |
| Attach alert to a training cluster | Training Cluster → select job → Alert tab → select alert → Attach |