Skip to main content

Pricing

Model Endpoint costs depend on the plan type (Hourly or Committed), the machine family (H200, H100, A100, L40S, L4 CPU, Multi-Node), and how you configure workers.

Pricing is based on how much capacity you use and how you pay (hourly or committed).

Two Ways to Pay

OptionBest forHow you’re charged
HourlyVariable or short-term workloadsPer hour of endpoint uptime, per replica. You pay only while the endpoint is running.
Committed (reserved)Steady, long-running workloadsFixed price for a reserved period (e.g. 30 days) per replica. Often lower effective hourly rate; the endpoint cannot be stopped during the committed term.

What Affects Your Bill

  • Replicas — Each replica is one copy of your model. Cost scales with the number of replicas (e.g. 2 replicas ≈ 2× the per-replica cost).
  • Uptime (hourly) — For hourly pricing, you are charged for the time the endpoint is running within the billing period (prorated if you start or stop mid-period).
  • Resource type (SKU) — The GPU or compute type you choose (e.g. A100, L4) has a different price per hour or per committed period.

Understanding Your Usage and Bills

  • Use the inference dashboard to see:
    • Which endpoints were running
    • How many replicas
    • Hours of usage (for hourly)
    • Unit price and total cost per endpoint
  • Usage is prorated: if an endpoint runs for only part of a billing cycle, you’re charged only for that portion.

Plan Types by Machine Family

Each endpoint plan uses one of several underlying machine families, tuned for different workloads:

Machine FamilyTypeBest ForKey Highlights
H200GPUVery large LLMs, high-throughput inferenceNext-generation NVIDIA GPU with high memory capacity
H100GPUHeavy LLMs, high QPS production workloadsFlagship NVIDIA GPU for enterprise-scale AI
A100GPUNLP, vision, recommendation modelsProven, versatile GPU for general-purpose AI inference
L40SGPUInference + light training/fine-tuningBalanced cost and performance for mixed workloads
L4GPUAI video, inference, graphics workloadsEnergy-efficient, versatile GPU for enterprise AI
CPUvCPUApplications, APIs, databases, lightweight inferenceDedicated virtual compute with scalable vCPUs (no GPU)
Multi-NodeDistributed GPUMassive models, real-time production AIMulti-node architecture for scalability, reliability, and low latency

For detailed Pricing visit -> E2E Calculator

What affects cost?

  • Instance type (CPU/GPU) and size
  • Active Workers / Max Workers configuration (serverless scaling)
  • Idle timeout and other scaling policy parameters (when applicable)

Pricing Examples

Note: The values below are illustrative. Use the E2E Calculator for current rates.

Example 1: Hourly Plan

Scenario: You run 1 L4 replica for 8 hours per day during development, 5 days a week.

FactorValue
PlanHourly
MachineL4
Replicas1
Uptime8 hrs/day × 5 days × 4 weeks ≈ 160 hrs/month

Billing: Cost = 160 replica-hours × (price per hour per L4 replica). You pay only for the 160 hours the endpoint runs; scale to 0 when not in use to avoid charges.

Recommendation: Hourly is ideal for dev/test. Scale to zero when idle to minimize cost.


Example 2: Committed Plan (30 Days)

Scenario: You commit to 1 H100 replica for 30 days.

FactorValue
PlanCommitted (30 days)
MachineH100
Replicas1
Duration30 days

Billing: Fixed price for 1 replica for the 30-day term. The endpoint cannot be stopped during the commitment; you pay for the full period regardless of actual usage.

After the committed period ends: At the time of deploying a committed machine, you choose one of three actions that will apply automatically once the committed term completes:

OptionWhat happens
Auto-renewThe committed plan renews for another term at the same configuration.
Convert to hourly billingThe endpoint switches to hourly billing so you only pay for actual uptime going forward.
Auto-deleteThe endpoint is automatically deleted once the committed period ends.

Recommendation: Committed plans suit predictable, long-running workloads and often offer a lower effective hourly rate than pay-as-you-go.


Frequently Asked Questions(FAQs)

Is there a minimum billing duration?

For hourly pricing, usage is prorated to the billing period. You are charged for the actual time your endpoint is running (e.g. if it runs 45 minutes, you pay for that portion of an hour within the billing cycle).

For committed pricing, you commit for a fixed period (e.g. 30 or 90 days); that period is the effective "minimum" for that reservation.


If my endpoint scales to 0, am I charged anything?

  • Hourly: No. When there are zero replicas running, there is no compute usage to charge. Your cost is zero for that endpoint while it is at 0.
  • Committed: You have reserved capacity for a fixed term. You generally continue to pay for the committed replicas for the remainder of that term even if you scale down, because the capacity is reserved for you.

How is billing handled during auto-scaling events?

Billing follows the actual number of replicas over time. When autoscaling adds or removes replicas, your usage (replica-hours) changes accordingly:

  • Hourly: You pay for each replica-hour. More replicas = higher cost during that period; fewer replicas = lower cost. The system tracks replica count and uptime, so scaling up or down is reflected in your bill.
  • Committed: You pay for the committed replica count for the full term.

Do I pay for cold start time?

Billing is based on when the replica is running (from the platform's perspective). Time spent in "cold start" (loading the model, initializing) is typically part of the replica's uptime, so it is usually included in compute billing.


How quickly does scaling to 0 happen?

Scale-down timing is platform- and configuration-dependent. It can be influenced by:

  • Scale-down delay or cooldown (to avoid flapping)
  • Graceful shutdown and drain of in-flight requests
  • Cluster and scheduler behavior

Can I change from committed to hourly later?

Yes. When you deploy a committed machine, you choose a post-commitment action — one of which is Convert to hourly billing. Once the committed term ends, the endpoint automatically switches to hourly. You can also use the "update action on the endpoint and choose the "convert to hourly" option.


Do I pay for Max Workers or only Active Workers?

You pay for replicas that are running (active) for hourly pricing: cost = replica-hours × unit price. So you pay for active workers, not a "max workers" cap.

For committed pricing, you pay for the active (committed) replica count. For example, if your active workers are 1 and max workers are 2, you are initially charged for 1 active worker only at the committed rate. When demand increases and the endpoint scales up to the 2nd worker, that additional worker is billed on an hourly basis.


Is pricing based on number of requests or compute usage?

Pricing is based on compute usage (replica time), not on the number of requests. You are charged for:

  • Hourly: (Usage in hours) × (price per hour per replica) × (number of replicas).
  • Committed: Fixed price per replica for the commitment period.

What happens to billing if the endpoint fails?

When an endpoint fails or is stopped (e.g. error state, you stop it, or it is terminated), the system records the end time for that run. You are charged only for the time the endpoint was actually running up to that end time. You do not continue to be charged for compute after the endpoint has failed or been stopped. If the endpoint is restarted later, a new usage segment starts and billing resumes from that point.