--- title: Overview description: Monitor resource metrics across your cloud infrastructure and receive notifications when defined conditions are met. --- # Alert management Alert management lets you monitor resource metrics and receive notifications when a condition you define is met. Alerts can be configured for Instance, Inference, and Training Cluster services. Notifications can be delivered via email, Slack, or an external endpoint using webhooks. Alerts are accessible from two locations: - **Global view** : left sidebar → Alert Management - **Resource view** : resource detail page → Alert tab --- ## How it works Setting up alerts follows a three-step workflow: 1. Create a monitoring integration : configure where alerts are sent (email, Slack, or webhook). 2. Create an alert : define the metric, condition, and severity, then link it to a monitoring integration. 3. Attach to a resource : bind the alert to a running service to activate monitoring. > **Note:** Use the Test button when configuring a monitoring integration to confirm notifications are being delivered correctly **before going live**. --- ## Prerequisites Before creating an alert, ensure the following are in place: - At least one monitoring integration has been configured. - An active Instance, Inference, or Training Cluster service is running. - For Slack: you have an Incoming Webhook URL, or a Bot Token and Channel ID. - For webhooks: your endpoint is reachable and can accept POST requests with a JSON body. --- ## Core concepts | Term | Description | |---|---| | **Alert** | A rule that triggers a notification when a monitored metric meets a defined condition. | | **Monitoring integration** | A named collection of destinations where alert notifications are sent. | | **Alert trigger type** | The specific resource metric being evaluated (e.g. CPU utilization, GPU temperature). | | **Threshold value** | The numeric value evaluated against the operator to determine when the alert fires. | | **Operator** | The comparison condition applied to the threshold (e.g. greater than, less than). | | **Severity** | Indicates priority level. Possible values: `Critical`, `Warning`, `Info`. | --- ## Supported metrics | Metric | Unit | Purpose | Instance | Inference | Training Cluster | |---|---|---|:---:|:---:|:---:| | **CPU utilization** | % | Detect compute saturation | ✓ | ✓ | ✓ | | **Memory utilization** | % | Identify memory pressure or OOM risk | ✓ | ✓ | ✓ | | **Disk utilization** | % | Prevent storage exhaustion on volumes | ✓ | | | | **Ephemeral storage** | % | Monitor temporary container storage | ✓ | | | | **GPU temperature** | °C | Protect against thermal throttling or damage | ✓ | ✓ | ✓ | | **GPU power** | W | Detect abnormal power consumption | ✓ | ✓ | ✓ | | **XID error** | — | GPU error detection | | | ✓ | --- ## Monitoring integrations A monitoring integration defines where alert notifications are delivered. Each integration can include one or more destinations: email, Slack, or webhook. ### 1. Email Email is the default notification channel. When an alert condition is met, a notification is sent to all email addresses configured in the integration. - At least one valid email address is required. - Multiple addresses can be added to notify all relevant stakeholders. - Each email includes the resource name, triggered metric, severity level, and event timestamp. ### 2. Slack Slack notifications deliver alerts directly to a channel in your Slack workspace. Two connection methods are supported: | Method | Credentials required | Best for | |---|---|---| | Webhook | Incoming Webhook URL | Single channel, simple setup | | Token | Bot Token + Channel ID | Multiple channels, programmatic control | #### 2.1 Webhook method The webhook method uses Slack's Incoming Webhooks feature. It generates a unique URL that posts messages to a specific channel. This is the recommended starting point for teams needing simple notification delivery without complex bot permissions. #### 2.2 Token method The token method uses a Slack app Bot Token to post messages on behalf of a configured app. It requires the `chat:write` scope to be enabled under OAuth & Permissions in your Slack app settings. > **Note:** When using the token method, you must invite the bot to the target channel by running `/invite @YourAppName` in Slack. If the bot is not a member of the channel, **notifications will fail silently with no error returned**. #### Choosing a method Use the webhook method for a straightforward, single-channel setup. Use the token method if you need to send alerts to multiple channels or require stricter control over app permissions and bot identity. For step-by-step setup instructions, see [Slack alerts](/alert-management/slack-alerts). ### 3. Webhook Webhook alerts send a POST request to an external URL each time an alert is triggered or resolved. Use this to trigger automated workflows in external tools in real time. Technical requirements for your endpoint: - Must accept POST requests with `Content-Type: application/json`. - Must return a `2xx` HTTP status code to confirm successful receipt. - If authentication is required, validate the `Authorization` header before processing the request body. For the full payload schema and field reference, see [Webhook alerts](/alert-management/webhook-alerts). --- ### Managing Existing Monitoring Integration From **Manage Monitoring Integration**, all Integration are listed with their description, member count, attached alerts, creator, and creation timestamp. Use the **Actions** menu to **Update** or **Delete** a group. --- ## Alerts ### Create an alert Go to Alert Management → Create Alert. | Field | Description | Example | |---|---|---| | Alert name | Auto-generated by default. Replace with a descriptive name. | `prod-gpu-temp-critical` | | Service type | The service the alert applies to. Cannot be changed after creation. | Instance, Inference, Training Cluster | | Monitoring integration | The integration that contains your notification destinations. | `ops-slack-webhook` | | Alert trigger type | The metric to monitor. | CPU utilization, GPU temperature | | Severity | Priority level of the alert. | `Critical`, `Warning`, `Info` | | Operator | Comparison condition applied to the threshold. | Greater than, Less than, Equal to | | Threshold value | The numeric value that triggers the alert. | `85` for 85% CPU utilization | Click Create to save. ### Manage existing alerts Go to Alert Management → Manage Alerts. Use the Actions menu to update or delete an alert. - **Update** : the following fields can be modified: alert name, severity, alert trigger type, operator, threshold value, and monitoring integration. **Service type cannot be changed after creation.** - **Delete** : permanently removes the alert. No further notifications will be sent to the associated destinations. --- ## Attach an alert to a resource Alerts can only be attached to resources in the **Running** state. | Resource | Path | |---|---| | Instance | Instances → select instance → Alert tab → select alert → Attach | | Inference | Inference → select service → Alert tab → select alert → Attach | | Training Cluster | Training Cluster → select job → Alert tab → select alert → Attach | To create a new alert directly from a resource view, click Click here in the Alerts panel. The Create Alert form opens with the service type pre-set to the current resource. --- ## Recommended usage - Capacity planning : set Memory and Disk utilization alerts at 75–80% to get early warnings before resource saturation, giving teams time to scale or optimize. - GPU health for ML workloads : configure GPU temperature and GPU power alerts to detect overheating or abnormal power draw during long training runs. - Inference container storage : use ephemeral storage alerts to prevent silent failures caused by container-level storage exhaustion during inference workloads. - Incident response : align monitoring integrations with on-call rotations so the right engineers are notified immediately, reducing mean time to recovery. --- ## Quick reference | Task | Path | |---|---| | Open alert management | Left sidebar → Alert Management | | View all alerts | Alert Management → Manage Alerts | | Create a new alert | Alert Management → Create Alert | | Manage monitoring integrations | Alert Management → Manage Monitoring Integration | | Attach alert to an instance | Instances → select instance → Alert tab → select alert → Attach | | Attach alert to an inference service | Inference → select service → Alert tab → select alert → Attach | | Attach alert to a training cluster | Training Cluster → select job → Alert tab → select alert → Attach | ---