Pricing

Model Endpoint costs depend on the plan type (Hourly or Committed), the machine family (H200, H100, A100, L40S, L4 CPU, Multi-Node), and how you configure workers.

Pricing is based on how much capacity you use and how you pay (hourly or committed).

Two Ways to Pay

Option	Best for	How you’re charged
Hourly	Variable or short-term workloads	Per hour of endpoint uptime, per replica. You pay only while the endpoint is running.
Committed (reserved)	Steady, long-running workloads	Fixed price for a reserved period (e.g. 30 days) per replica. Often lower effective hourly rate; the endpoint cannot be stopped during the committed term.

What Affects Your Bill

Replicas — Each replica is one copy of your model. Cost scales with the number of replicas (e.g. 2 replicas ≈ 2× the per-replica cost).
Uptime (hourly) — For hourly pricing, you are charged for the time the endpoint is running within the billing period (prorated if you start or stop mid-period).
Resource type (SKU) — The GPU or compute type you choose (e.g. A100, L4) has a different price per hour or per committed period.

Understanding Your Usage and Bills

Use the inference dashboard to see:
- Which endpoints were running
- How many replicas
- Hours of usage (for hourly)
- Unit price and total cost per endpoint
Usage is prorated: if an endpoint runs for only part of a billing cycle, you’re charged only for that portion.

Plan Types by Machine Family

Each endpoint plan uses one of several underlying machine families, tuned for different workloads:

Machine Family	Type	Best For	Key Highlights
H200	GPU	Very large LLMs, high-throughput inference	Next-generation NVIDIA GPU with high memory capacity
H100	GPU	Heavy LLMs, high QPS production workloads	Flagship NVIDIA GPU for enterprise-scale AI
A100	GPU	NLP, vision, recommendation models	Proven, versatile GPU for general-purpose AI inference
L40S	GPU	Inference + light training/fine-tuning	Balanced cost and performance for mixed workloads
L4	GPU	AI video, inference, graphics workloads	Energy-efficient, versatile GPU for enterprise AI
CPU	vCPU	Applications, APIs, databases, lightweight inference	Dedicated virtual compute with scalable vCPUs (no GPU)
Multi-Node	Distributed GPU	Massive models, real-time production AI	Multi-node architecture for scalability, reliability, and low latency

For detailed Pricing visit -> E2E Calculator

What affects cost?

Instance type (CPU/GPU) and size
Active Workers / Max Workers configuration (serverless scaling)
Idle timeout and other scaling policy parameters (when applicable)

Pricing Examples

Note: The values below are illustrative. Use the E2E Calculator for current rates.

Example 1: Hourly Plan

Scenario: You run 1 L4 replica for 8 hours per day during development, 5 days a week.

Factor	Value
Plan	Hourly
Machine	L4
Replicas	1
Uptime	8 hrs/day × 5 days × 4 weeks ≈ 160 hrs/month

Billing: Cost = 160 replica-hours × (price per hour per L4 replica). You pay only for the 160 hours the endpoint runs; scale to 0 when not in use to avoid charges.

Recommendation: Hourly is ideal for dev/test. Scale to zero when idle to minimize cost.

Example 2: Committed Plan (30 Days)

Scenario: You commit to 1 H100 replica for 30 days.

Factor	Value
Plan	Committed (30 days)
Machine	H100
Replicas	1
Duration	30 days

Billing: Fixed price for 1 replica for the 30-day term. The endpoint cannot be stopped during the commitment; you pay for the full period regardless of actual usage.

After the committed period ends: At the time of deploying a committed machine, you choose one of three actions that will apply automatically once the committed term completes:

Option	What happens
Auto-renew	The committed plan renews for another term at the same configuration.
Convert to hourly billing	The endpoint switches to hourly billing so you only pay for actual uptime going forward.
Auto-delete	The endpoint is automatically deleted once the committed period ends.

Recommendation: Committed plans suit predictable, long-running workloads and often offer a lower effective hourly rate than pay-as-you-go.

Frequently Asked Questions(FAQs)

Is there a minimum billing duration?

For hourly pricing, usage is prorated to the billing period. You are charged for the actual time your endpoint is running (e.g. if it runs 45 minutes, you pay for that portion of an hour within the billing cycle).

For committed pricing, you commit for a fixed period (e.g. 30 or 90 days); that period is the effective "minimum" for that reservation.

If my endpoint scales to 0, am I charged anything?

Hourly: No. When there are zero replicas running, there is no compute usage to charge. Your cost is zero for that endpoint while it is at 0.
Committed: You have reserved capacity for a fixed term. You generally continue to pay for the committed replicas for the remainder of that term even if you scale down, because the capacity is reserved for you.

How is billing handled during auto-scaling events?

Billing follows the actual number of replicas over time. When autoscaling adds or removes replicas, your usage (replica-hours) changes accordingly:

Hourly: You pay for each replica-hour. More replicas = higher cost during that period; fewer replicas = lower cost. The system tracks replica count and uptime, so scaling up or down is reflected in your bill.
Committed: You pay for the committed replica count for the full term.

Do I pay for cold start time?

Billing is based on when the replica is running (from the platform's perspective). Time spent in "cold start" (loading the model, initializing) is typically part of the replica's uptime, so it is usually included in compute billing.

How quickly does scaling to 0 happen?

Scale-down timing is platform- and configuration-dependent. It can be influenced by:

Scale-down delay or cooldown (to avoid flapping)
Graceful shutdown and drain of in-flight requests
Cluster and scheduler behavior

Can I change from committed to hourly later?

Yes. When you deploy a committed machine, you choose a post-commitment action — one of which is Convert to hourly billing. Once the committed term ends, the endpoint automatically switches to hourly. You can also use the "update action on the endpoint and choose the "convert to hourly" option.

Do I pay for Max Workers or only Active Workers?

You pay for replicas that are running (active) for hourly pricing: cost = replica-hours × unit price. So you pay for active workers, not a "max workers" cap.

For committed pricing, you pay for the active (committed) replica count. For example, if your active workers are 1 and max workers are 2, you are initially charged for 1 active worker only at the committed rate. When demand increases and the endpoint scales up to the 2nd worker, that additional worker is billed on an hourly basis.

Is pricing based on number of requests or compute usage?

Pricing is based on compute usage (replica time), not on the number of requests. You are charged for:

Hourly: (Usage in hours) × (price per hour per replica) × (number of replicas).
Committed: Fixed price per replica for the commitment period.

What happens to billing if the endpoint fails?

When an endpoint fails or is stopped (e.g. error state, you stop it, or it is terminated), the system records the end time for that run. You are charged only for the time the endpoint was actually running up to that end time. You do not continue to be charged for compute after the endpoint has failed or been stopped. If the endpoint is restarted later, a new usage segment starts and billing resumes from that point.

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

Two Ways to Pay​

What Affects Your Bill​

Understanding Your Usage and Bills​

Plan Types by Machine Family​

What affects cost?​

Pricing Examples​

Example 1: Hourly Plan​

Example 2: Committed Plan (30 Days)​

Frequently Asked Questions(FAQs)​

Is there a minimum billing duration?​

If my endpoint scales to 0, am I charged anything?​

How is billing handled during auto-scaling events?​

Do I pay for cold start time?​

How quickly does scaling to 0 happen?​

Can I change from committed to hourly later?​

Do I pay for Max Workers or only Active Workers?​

Is pricing based on number of requests or compute usage?​

What happens to billing if the endpoint fails?​

Two Ways to Pay

What Affects Your Bill

Understanding Your Usage and Bills

Plan Types by Machine Family

What affects cost?

Pricing Examples

Example 1: Hourly Plan

Example 2: Committed Plan (30 Days)

Frequently Asked Questions(FAQs)

Is there a minimum billing duration?

If my endpoint scales to 0, am I charged anything?

How is billing handled during auto-scaling events?

Do I pay for cold start time?

How quickly does scaling to 0 happen?

Can I change from committed to hourly later?

Do I pay for Max Workers or only Active Workers?

Is pricing based on number of requests or compute usage?

What happens to billing if the endpoint fails?