Pricing
Model Endpoint costs depend on the plan type (Hourly or Committed), the machine family (H200, H100, A100, L40S, L4 CPU, Multi-Node), and how you configure workers.
Pricing is based on how much capacity you use and how you pay (hourly or committed).
Two Ways to Pay
| Option | Best for | How you’re charged |
|---|---|---|
| Hourly | Variable or short-term workloads | Per hour of endpoint uptime, per replica. You pay only while the endpoint is running. |
| Committed (reserved) | Steady, long-running workloads | Fixed price for a reserved period (e.g. 30 days) per replica. Often lower effective hourly rate; the endpoint cannot be stopped during the committed term. |
What Affects Your Bill
- Replicas — Each replica is one copy of your model. Cost scales with the number of replicas (e.g. 2 replicas ≈ 2× the per-replica cost).
- Uptime (hourly) — For hourly pricing, you are charged for the time the endpoint is running within the billing period (prorated if you start or stop mid-period).
- Resource type (SKU) — The GPU or compute type you choose (e.g. A100, L4) has a different price per hour or per committed period.
Understanding Your Usage and Bills
- Use the inference dashboard to see:
- Which endpoints were running
- How many replicas
- Hours of usage (for hourly)
- Unit price and total cost per endpoint
- Usage is prorated: if an endpoint runs for only part of a billing cycle, you’re charged only for that portion.
Plan Types by Machine Family
Each endpoint plan uses one of several underlying machine families, tuned for different workloads:
| Machine Family | Type | Best For | Key Highlights |
|---|---|---|---|
| H200 | GPU | Very large LLMs, high-throughput inference | Next-generation NVIDIA GPU with high memory capacity |
| H100 | GPU | Heavy LLMs, high QPS production workloads | Flagship NVIDIA GPU for enterprise-scale AI |
| A100 | GPU | NLP, vision, recommendation models | Proven, versatile GPU for general-purpose AI inference |
| L40S | GPU | Inference + light training/fine-tuning | Balanced cost and performance for mixed workloads |
| L4 | GPU | AI video, inference, graphics workloads | Energy-efficient, versatile GPU for enterprise AI |
| CPU | vCPU | Applications, APIs, databases, lightweight inference | Dedicated virtual compute with scalable vCPUs (no GPU) |
| Multi-Node | Distributed GPU | Massive models, real-time production AI | Multi-node architecture for scalability, reliability, and low latency |
For detailed Pricing visit -> E2E Calculator
What affects cost?
- Instance type (CPU/GPU) and size
- Active Workers / Max Workers configuration (serverless scaling)
- Idle timeout and other scaling policy parameters (when applicable)
Pricing Examples
Note: The values below are illustrative. Use the E2E Calculator for current rates.
Example 1: Hourly Plan
Scenario: You run 1 L4 replica for 8 hours per day during development, 5 days a week.
| Factor | Value |
|---|---|
| Plan | Hourly |
| Machine | L4 |
| Replicas | 1 |
| Uptime | 8 hrs/day × 5 days × 4 weeks ≈ 160 hrs/month |
Billing: Cost = 160 replica-hours × (price per hour per L4 replica). You pay only for the 160 hours the endpoint runs; scale to 0 when not in use to avoid charges.
Recommendation: Hourly is ideal for dev/test. Scale to zero when idle to minimize cost.
Example 2: Committed Plan (30 Days)
Scenario: You commit to 1 H100 replica for 30 days.
| Factor | Value |
|---|---|
| Plan | Committed (30 days) |
| Machine | H100 |
| Replicas | 1 |
| Duration | 30 days |
Billing: Fixed price for 1 replica for the 30-day term. The endpoint cannot be stopped during the commitment; you pay for the full period regardless of actual usage.
After the committed period ends: At the time of deploying a committed machine, you choose one of three actions that will apply automatically once the committed term completes:
| Option | What happens |
|---|---|
| Auto-renew | The committed plan renews for another term at the same configuration. |
| Convert to hourly billing | The endpoint switches to hourly billing so you only pay for actual uptime going forward. |
| Auto-delete | The endpoint is automatically deleted once the committed period ends. |
Recommendation: Committed plans suit predictable, long-running workloads and often offer a lower effective hourly rate than pay-as-you-go.
Frequently Asked Questions(FAQs)
Is there a minimum billing duration?
For hourly pricing, usage is prorated to the billing period. You are charged for the actual time your endpoint is running (e.g. if it runs 45 minutes, you pay for that portion of an hour within the billing cycle).
For committed pricing, you commit for a fixed period (e.g. 30 or 90 days); that period is the effective "minimum" for that reservation.
If my endpoint scales to 0, am I charged anything?
- Hourly: No. When there are zero replicas running, there is no compute usage to charge. Your cost is zero for that endpoint while it is at 0.
- Committed: You have reserved capacity for a fixed term. You generally continue to pay for the committed replicas for the remainder of that term even if you scale down, because the capacity is reserved for you.
How is billing handled during auto-scaling events?
Billing follows the actual number of replicas over time. When autoscaling adds or removes replicas, your usage (replica-hours) changes accordingly:
- Hourly: You pay for each replica-hour. More replicas = higher cost during that period; fewer replicas = lower cost. The system tracks replica count and uptime, so scaling up or down is reflected in your bill.
- Committed: You pay for the committed replica count for the full term.
Do I pay for cold start time?
Billing is based on when the replica is running (from the platform's perspective). Time spent in "cold start" (loading the model, initializing) is typically part of the replica's uptime, so it is usually included in compute billing.
How quickly does scaling to 0 happen?
Scale-down timing is platform- and configuration-dependent. It can be influenced by:
- Scale-down delay or cooldown (to avoid flapping)
- Graceful shutdown and drain of in-flight requests
- Cluster and scheduler behavior
Can I change from committed to hourly later?
Yes. When you deploy a committed machine, you choose a post-commitment action — one of which is Convert to hourly billing. Once the committed term ends, the endpoint automatically switches to hourly. You can also use the "update action on the endpoint and choose the "convert to hourly" option.
Do I pay for Max Workers or only Active Workers?
You pay for replicas that are running (active) for hourly pricing: cost = replica-hours × unit price. So you pay for active workers, not a "max workers" cap.
For committed pricing, you pay for the active (committed) replica count. For example, if your active workers are 1 and max workers are 2, you are initially charged for 1 active worker only at the committed rate. When demand increases and the endpoint scales up to the 2nd worker, that additional worker is billed on an hourly basis.
Is pricing based on number of requests or compute usage?
Pricing is based on compute usage (replica time), not on the number of requests. You are charged for:
- Hourly: (Usage in hours) × (price per hour per replica) × (number of replicas).
- Committed: Fixed price per replica for the commitment period.
What happens to billing if the endpoint fails?
When an endpoint fails or is stopped (e.g. error state, you stop it, or it is terminated), the system records the end time for that run. You are charged only for the time the endpoint was actually running up to that end time. You do not continue to be charged for compute after the endpoint has failed or been stopped. If the endpoint is restarted later, a new usage segment starts and billing resumes from that point.