Model Endpoint costs depend on the **plan type** (Hourly or Committed), the **machine family** (H200, H100, A100, L40S, L4 CPU, Multi-Node), and how you configure **workers**. Pricing is based on **how much capacity you use** and **how you pay** (hourly or committed). ### Two Ways to Pay | Option | Best for | How you’re charged | |--------|----------|---------------------| | **Hourly** | Variable or short-term workloads | Per hour of endpoint uptime, per replica. You pay only while the endpoint is running. | | **Committed (reserved)** | Steady, long-running workloads | Fixed price for a reserved period (e.g. 30 days) per replica. Often lower effective hourly rate; the endpoint cannot be stopped during the committed term. | ### What Affects Your Bill - **Replicas** — Each replica is one copy of your model. Cost scales with the number of replicas (e.g. 2 replicas ≈ 2× the per-replica cost). - **Uptime (hourly)** — For hourly pricing, you are charged for the time the endpoint is running within the billing period (prorated if you start or stop mid-period). - **Resource type (SKU)** — The GPU or compute type you choose (e.g. A100, L4) has a different price per hour or per committed period. ### Understanding Your Usage and Bills - Use the **inference dashboard** to see: - Which endpoints were running - How many replicas - Hours of usage (for hourly) - Unit price and total cost per endpoint - Usage is prorated: if an endpoint runs for only part of a billing cycle, you’re charged only for that portion. ### Plan Types by Machine Family Each endpoint plan uses one of several underlying machine families, tuned for different workloads: | Machine Family | Type | Best For | Key Highlights | |---------------|------|----------|----------------| | **H200** | GPU | Very large LLMs, high-throughput inference | Next-generation NVIDIA GPU with high memory capacity | | **H100** | GPU | Heavy LLMs, high QPS production workloads | Flagship NVIDIA GPU for enterprise-scale AI | | **A100** | GPU | NLP, vision, recommendation models | Proven, versatile GPU for general-purpose AI inference | | **L40S** | GPU | Inference + light training/fine-tuning | Balanced cost and performance for mixed workloads | | **L4** | GPU | AI video, inference, graphics workloads | Energy-efficient, versatile GPU for enterprise AI | | **CPU** | vCPU | Applications, APIs, databases, lightweight inference | Dedicated virtual compute with scalable vCPUs (no GPU) | | **Multi-Node** | Distributed GPU | Massive models, real-time production AI | Multi-node architecture for scalability, reliability, and low latency | --- For detailed Pricing visit -> [E2E Calculator](https://calculator.e2enetworks.com/estimate-pricing) ### What affects cost? - **Instance type** (CPU/GPU) and size - **Active Workers / Max Workers** configuration (serverless scaling) - **Idle timeout** and other scaling policy parameters (when applicable) --- ## Pricing Examples *Note: The values below are illustrative. Use the [E2E Calculator](https://calculator.e2enetworks.com/estimate-pricing) for current rates.* ### Example 1: Hourly Plan **Scenario:** You run 1 L4 replica for 8 hours per day during development, 5 days a week. | Factor | Value | |--------|-------| | Plan | Hourly | | Machine | L4 | | Replicas | 1 | | Uptime | 8 hrs/day × 5 days × 4 weeks ≈ 160 hrs/month | **Billing:** Cost = 160 replica-hours × (price per hour per L4 replica). You pay only for the 160 hours the endpoint runs; scale to 0 when not in use to avoid charges. **Recommendation:** Hourly is ideal for dev/test. Scale to zero when idle to minimize cost. --- ### Example 2: Committed Plan (30 Days) **Scenario:** You commit to 1 H100 replica for 30 days. | Factor | Value | |--------|-------| | Plan | Committed (30 days) | | Machine | H100 | | Replicas | 1 | | Duration | 30 days | **Billing:** Fixed price for 1 replica for the 30-day term. The endpoint cannot be stopped during the commitment; you pay for the full period regardless of actual usage. **After the committed period ends:** At the time of deploying a committed machine, you choose one of three actions that will apply automatically once the committed term completes: | Option | What happens | |--------|-------------| | **Auto-renew** | The committed plan renews for another term at the same configuration. | | **Convert to hourly billing** | The endpoint switches to hourly billing so you only pay for actual uptime going forward. | | **Auto-delete** | The endpoint is automatically deleted once the committed period ends. | **Recommendation:** Committed plans suit predictable, long-running workloads and often offer a lower effective hourly rate than pay-as-you-go. --- ## Frequently Asked Questions(FAQs) #### Is there a minimum billing duration? For **hourly** pricing, usage is prorated to the billing period. You are charged for the actual time your endpoint is running (e.g. if it runs 45 minutes, you pay for that portion of an hour within the billing cycle). For **committed** pricing, you commit for a fixed period (e.g. 30 or 90 days); that period is the effective "minimum" for that reservation. --- #### If my endpoint scales to 0, am I charged anything? - **Hourly:** No. When there are zero replicas running, there is no compute usage to charge. Your cost is zero for that endpoint while it is at 0. - **Committed:** You have reserved capacity for a fixed term. You generally continue to pay for the **committed replicas** for the remainder of that term even if you scale down, because the capacity is reserved for you. --- #### How is billing handled during auto-scaling events? Billing follows the **actual number of replicas** over time. When autoscaling adds or removes replicas, your usage (replica-hours) changes accordingly: - **Hourly:** You pay for each replica-hour. More replicas = higher cost during that period; fewer replicas = lower cost. The system tracks replica count and uptime, so scaling up or down is reflected in your bill. - **Committed:** You pay for the **committed replica count** for the full term. --- #### Do I pay for cold start time? Billing is based on when the replica is **running** (from the platform's perspective). Time spent in "cold start" (loading the model, initializing) is typically part of the replica's uptime, so it is usually included in compute billing. --- #### How quickly does scaling to 0 happen? Scale-down timing is **platform- and configuration-dependent**. It can be influenced by: - Scale-down delay or cooldown (to avoid flapping) - Graceful shutdown and drain of in-flight requests - Cluster and scheduler behavior --- #### Can I change from committed to hourly later? Yes. When you deploy a committed machine, you choose a post-commitment action — one of which is **Convert to hourly billing**. Once the committed term ends, the endpoint automatically switches to hourly. You can also use the "update action on the endpoint and choose the "convert to hourly" option. --- #### Do I pay for Max Workers or only Active Workers? You pay for **replicas that are running (active)** for **hourly** pricing: cost = replica-hours × unit price. So you pay for **active** workers, not a "max workers" cap. For **committed** pricing, you pay for the **active (committed) replica count**. For example, if your active workers are 1 and max workers are 2, you are initially charged for 1 active worker only at the committed rate. When demand increases and the endpoint scales up to the 2nd worker, that additional worker is billed on an **hourly** basis. --- #### Is pricing based on number of requests or compute usage? Pricing is based on **compute usage** (replica time), **not** on the number of requests. You are charged for: - **Hourly:** (Usage in hours) × (price per hour per replica) × (number of replicas). - **Committed:** Fixed price per replica for the commitment period. --- #### What happens to billing if the endpoint fails? When an endpoint **fails** or is **stopped** (e.g. error state, you stop it, or it is terminated), the system records the **end time** for that run. You are charged only for the time the endpoint was **actually running** up to that end time. You do not continue to be charged for compute after the endpoint has failed or been stopped. If the endpoint is restarted later, a new usage segment starts and billing resumes from that point. --- --- ---