Skip to main content

Features

TIR Model Endpoints provide the building blocks you need to deploy and operate inference APIs reliably.



1. Deployment options​

  • Deploy Using Pre-built Containers (Provided by TIR) β€” Before launching a service with TIR pre-built containers, you must first create a TIR Model and upload the required model files. These containers automatically:

    • Download model files from an EOS (E2E Object Storage) bucket.
    • Initialize the containerized API server for inference.
      Once the endpoint is ready, you can make synchronous API requests for inference. ➑️ Learn more in this tutorial.
  • Deploy Using Your Own Container β€” You can also deploy using your own Docker image (public or private). Once deployed, inference requests can be made synchronously. Optionally, attach a TIR Model to automate downloading model files from EOS. ➑️ Learn more in this tutorial.

  • Model sources β€” Attach a Model Repository (e.g. S3, object storage) or pull directly from Hugging Face so the platform can load the model automatically when the endpoint starts.

  • Validation before deploy β€” Validate your model and configuration before creating an endpoint to catch issues early.

  • Multiple frameworks β€” Support for LLMs, embeddings, speech (ASR/TTS), diffusion, and other model types across regions and GPU types.


2. Serverless Configuration​

Serverless configuration controls how your model endpoint automatically scales based on incoming traffic and runtime load. It allows you to define the minimum and maximum number of workers, ensuring your service remains responsive during peak usage while scaling down during idle periods to reduce cost.

  • Set Active Workers to define the minimum number of running workers.
  • Set Max Workers to cap the maximum scale during high traffic.
  • When Active Workers and Max Workers are different, autoscaling is enabled.
  • Scale to zero when idle for true pay-only-when-used billing.
  • Manual scaling β€” Scale up or down on demand when you need an immediate change.

Serverless configuration works together with the Scaling Policy to determine when new workers are added or removed based on real-time metrics.

Metric Types​

The Metric Type defines what signal drives autoscaling. Three options are supported:

1. Concurrent Request Count

Scales based on the number of requests being actively processed by a single worker at the same time. When the number of in-flight requests exceeds the configured Target Value, a new worker is added.

  • Best for: latency-sensitive LLM workloads where each request holds GPU resources for the full generation duration.
  • Example: set Target Value to 10 β€” a new worker spins up once a worker is handling 10 concurrent requests.

2. Request Rate per Second

Scales based on the rate of incoming requests (requests per second) arriving at a single worker. When throughput exceeds the Target Value, new workers are added to distribute load.

  • Best for: high-throughput APIs where requests are short-lived and arrival rate is the primary bottleneck.
  • Example: set Target Value to 50 β€” a new worker is added once a worker is receiving more than 50 requests per second.

3. Custom

Scales based on a runtime-specific metric exposed by the serving framework (e.g. vLLM, SGLang, Triton, PyTorch). You specify the Metric Name and Target Value directly.

  • Best for: workloads where internal model state (queue depth, GPU cache usage, token throughput) is a better signal than raw request counts.
  • See Supported Custom Metrics by Runtime below for available metric names per framework.

Scaling Policy & Custom Metrics​

The Scaling Policy manages the elasticity of your model endpoint. By defining specific metrics and thresholds, you can ensure your service handles traffic spikes while minimizing costs during idle periods.

Core Scaling Parameters​

To fine-tune how and when your endpoint scales, you must configure these four key parameters:

ParameterDescription
Metric TypeThe data source used to trigger scaling (Concurrent Count, Request Rate, or Custom).
Target ValueThe desired load handled by a single worker. When the workload exceeds this value, additional workers are automatically added.
Idle TimeoutAmount of time in seconds idle workers will stay running for new requests. You are charged for idle timeout.
Initial Cooldown PeriodAmount of time service will wait after a trigger before scaling down to 0 replicas, helping prevent rapid fluctuations in replica count.
Custom Metrics​

Metrics that reflect the model's internal state (such as running requests, queue size, or GPU usage) and allow scaling based on actual workload.

Supported Custom Metrics by Runtime​

1. vLLM

  • vllm:num_requests_running β€” Number of requests currently being processed on the GPU.
  • vllm:num_requests_waiting β€” Number of requests queued and waiting to be scheduled.
  • vllm:num_requests_swapped β€” Number of requests swapped from GPU to CPU due to memory pressure.
  • vllm:gpu_cache_usage_perc β€” GPU KV-cache usage (1 = 100%), useful for preventing GPU out-of-memory.
  • vllm:cpu_cache_usage_perc β€” CPU KV-cache usage (1 = 100%).

2. SGLang

  • sglang:num_running_reqs β€” Number of requests currently running in the decoding engine.
  • sglang:num_queue_reqs β€” Number of requests waiting in the scheduler queue.
  • sglang:num_used_tokens β€” Total number of tokens currently in use.
  • sglang:gen_throughput β€” Token generation throughput in tokens per second.

3. Triton

  • nv_inference_pending_request_count β€” Number of inference requests waiting to be executed by the backend.
  • nv_inference_exec_count β€” Number of inference batch executions currently running.
  • nv_inference_request_success β€” Count of successfully processed inference requests.
  • nv_inference_request_failure β€” Count of failed inference requests.
  • nv_inference_count β€” Total number of inferences performed, where batch size is counted individually.

4. PyTorch

  • GPUUtilization β€” Percentage of GPU compute capacity currently in use.
  • GPUMemoryUtilization β€” Percentage of GPU memory currently allocated.
  • GPUMemoryUsed β€” Amount of GPU memory used, measured in megabytes.
  • CPUUtilization β€” Percentage of CPU capacity currently in use on the host.
  • MemoryUtilization β€” Percentage of system memory currently in use.
  • DiskUtilization β€” Percentage of disk space currently in use on the host.

Example Configuration:

To scale a vLLM-based LLM when the number of active running requests becomes high:

  • Metric Type: Custom
  • Metric Name: vllm:num_requests_running
  • Target Value: 32

This configuration scales the endpoint when a worker is actively processing around 32 requests at the same time, helping maintain stable latency as load increases.


3. Operations​

  • Centralized logs β€” View inference logs and per-replica logs (e.g. last N lines or last N seconds) from one place to debug and audit requests.

  • Deployment and lifecycle events β€” Track when endpoints are created, updated, started, stopped, or restarted so you know the state of your deployments.

  • Monitoring β€” Use hardware metrics (CPU, GPU, memory) and service metrics (requests, latency, errors) to monitor health and performance and set alerts.

  • Access logs β€” See who called your endpoint and when (including per-hour breakdowns) for compliance and usage analysis.

  • Async Invocation β€” For long-running or batch workloads, use async inference: submit a request, get a request ID, and poll for status or results so your app doesn’t block. Queues and status tracking are built in.

  • Lifecycle actions β€” Start, stop, update, or delete endpoints via API or UI. (Committed/reserved endpoints cannot be stopped during the commitment period.)


4. Network & Security​

Use Security Groups to control which IPs or CIDR ranges can reach your endpoint on specific ports (for example, only allowing traffic from your application nodes or VPC CIDR on port 8080) and block all other sources.

Use Attach VPC to give your inference endpoint a private IP address inside your E2E VPC, so it is reachable only from nodes and services on that private network (no public internet exposure).

What VPC Connect does​

AspectDescription
ScopeConnects a TIR Inference endpoint to an E2E MyAccount VPC using a private IP from your VPC CIDR.
Traffic pathRequests flow from your VPC nodes β†’ private VPC IP β†’ inference endpoint entirely inside E2E’s network.
Access modelEndpoint is reachable only from nodes/services attached to the same VPC (and allowed by security groups).
Use casesSecure, private access from app backends, internal microservices, and compliance-sensitive workloads.

Prerequisites​

You need the following before you can attach VPC Connect to an endpoint:

RequirementWhere to configureDetails
MyAccount VPC with E2E CIDRMyAccount β†’ Networking β†’ VPCVPC must be created using E2E-provided CIDR blocks. Custom CIDR blocks are not supported for VPC Connect.
Node attached to the VPCMyAccount β†’ Compute β†’ NodesAt least one node/instance attached to the VPC. You will SSH into this node to call the private endpoint.
TIR accountTIR portalActive TIR account with access to the Inference service.
Running inference endpointTIR β†’ Inference β†’ Model EndpointsA deployed endpoint (e.g. LLM, embeddings, vision) in the same region as your VPC.
info

Important: VPC Connect only works with VPCs created using E2E-provided CIDR blocks. If you created a VPC with a custom CIDR, create a new VPC using E2E’s default CIDR allocation before using VPC Connect.

How it works:​

  1. Create/verify VPC and node (MyAccount)

    • Create a VPC using E2E-provided CIDR blocks and attach at least one node to it.
    • This node will be used to initiate private API calls.
  2. Reserve and attach VPC IP

    • In Network β†’ Reserve IP, click Reserve from VPC and select your VPC to get a Reserved VPC IP.
    • On your inference endpoint (Inference β†’ Model Endpoints β†’ Network & Security β†’ VPC), choose Attach VPC, select the reserved IP, and attach it.
  3. Configure security groups

    • On the endpoint’s Network tab, create or select a Security Group.
    • Add an inbound rule: TCP, port 8080 (or your endpoint port), Source = Your Myaccount VPC IP (e.g 10.8.220.6) not the Reserved VPC IP or you can select Any Network.
    • Save the security group configuration.
  4. Get the Private Endpoint URL

    • Open your endpoint and go to the API Request tab.
    • Switch to the Private Endpoint view.
    • Copy the sample curl command that uses the private VPC IP (for example, http://10.8.220.7:8080/v1/completions).
  5. Call the endpoint from inside the VPC (MyAccount node)

    • From your machine, SSH into a node attached to the VPC:

      ssh root@<your-myaccount-node-ip>
    • From that SSH session, call the private endpoint URL (example):

      curl --location 'http://10.8.220.7:8080/v1/completions' \
      --header 'Content-Type: application/json' \
      --header "Authorization: Bearer YOUR_AUTH_TOKEN" \
      --data '{
      "model": "your-model-name",
      "prompt": "Tell me about AI",
      "max_tokens": 100
      }'
    • The exact and most up-to-date curl example is always available in the endpoint’s API Request β†’ Private Endpoint section.


5. Alerts​

Alerts are threshold-based triggers that enable proactive resource management for your inference endpoints. When a monitored metric exceeds a defined limit, TIR automatically sends an email notification, allowing you to respond before performance or availability is affected.

For detailed configuration, see Alert Management.


6. Playground​

Endpoint playground β€” Use the built-in playground to quickly test your deployed endpoints without writing code. You can configure generation parameters such as output length, temperature, top p, top k, and repetition penalty and send sample prompts to observe how the model responds.

  • Output length (max tokens) β€” Controls how long the model’s response can be. Higher values allow more detailed answers but consume more tokens and may be slower.

  • Temperature β€” Controls randomness. Lower values (e.g. 0.1–0.3) make outputs more deterministic; higher values (e.g. 0.7–1.0) make them more diverse and creative.

  • Top p β€” Enables nucleus sampling by limiting choices to the smallest set of tokens whose cumulative probability is β‰₯ p (e.g. 0.9). Lower values reduce randomness by shrinking this set.

  • Top k β€” Limits each generation step to the k most likely tokens (e.g. k=50). Smaller k makes outputs more focused; larger k allows more variety.

  • Repetition penalty β€” Penalizes tokens that have already appeared in the context to reduce repeated phrases or loops. Higher values make the model avoid repeating itself more aggressively.

info

The playground currently supports chat models only. To test non-chat endpoints, use your own client or direct API calls.


7. Advanced Runtime Settings​

Configure how your model loads and runs at inference time. These settings are passed as engine args (framework-specific) or resource details (env variables, disk, mount path). Framework support varies (e.g. vLLM, SGLang). Configure before launch or update an existing endpoint.

  • LLM settings β€” Control model loading, data type (dtype), quantization, context length, and memory allocation. Use these to optimize for speed vs. quality (e.g. FP16 vs. INT8), set max context length, and tune GPU memory usage.

  • System settings β€” Configure the inference runtime: scheduler type, batching behavior, concurrency limits, and logging level. Tune these to balance throughput, latency, and resource utilization under load.

  • Tokenizer settings β€” Specify tokenizer path, mode (e.g. auto vs. manual), pool size, and chat template. Essential for correct tokenization and chat formatting when using custom models or non-standard tokenizers.

  • Environment variables β€” Pass runtime env vars such as API keys, secrets, file paths, or framework-specific options. Use for credentials, config overrides, and any values your model or container expects at startup.

8. Async invocation​

Use async invocation when you want to submit an inference request, get an immediate acknowledgment, and have the platform process the request in the background and store the result in a destination you choose. Your application does not need to keep the connection open or poll in a tight loop until the result is ready.

What you get when you enable async​

  • Queued processing β€” Requests are queued and processed in the background. You get a quick response with a request ID and response location instead of waiting for the full inference result.
  • Stored results β€” When processing is complete, the response is written to the destination you configured (e.g. a dataset in object storage). You can then read the result from that location or use the async status API with the request ID to check status and get the result.
  • Serverless worker lifecycle β€” A serverless worker is launched to process async invocations. The worker is active and billed only while there are requests in the queue; once all requests are processed and the queue is empty, the worker automatically scales down to zero.

When you make an async call, the API response includes:

  • request_id β€” Use this to check status or locate the result (e.g. via the async status API or the response file path).
  • response_location β€” Indicates where the response will be (or has been) stored once the request is complete.

Setup: enable async and configure queues​

  1. Enable async for your endpoint (e.g. in the endpoint configuration or Async config section in the dashboard or API).
  2. Configure where responses go (Target/destination) and which routes use async (Application endpoint/routes), as described below.

Application endpoint: which routes are async​

You specify the list of routes (API paths) for which async behavior is enabled. For example, you might enable async only for a long-running route such as /v1/chat/completions or a custom batch route.

What to configureDetails for you
RoutesEnter the routes (paths) for which you want async invocation. Only these routes will accept async requests and return request_id and response_location instead of the full response in the same call.
HTTP methodsThese routes must only accept POST and OPTIONS. Ensure your endpoint or API is configured so that the async routes do not accept other methods (e.g. GET) for the async flow.

Make sure the routes you list are the ones your client will call when sending async inference requests.

Target: where async responses are stored​

You choose a destination where the platform will store the response when an async request completes.

What is supportedDetails for you
Dataset (object storage / EOS)Currently supported: a dataset in object storage (EOS). When an async request completes, the platform writes a JSON file containing the response.
File pathThe file is created at api/<id>/request/<request_id>.json relative to the root of the dataset, where <id> is a platform-generated identifier (e.g. is-8404).
For example, if your dataset root is s3://my-bucket/async-results/, the response for request ID 1fa135a4-a1cd-4ee3-ac0a-320c4a025c64 might be at .../async-results/api/is-8404/request/1fa135a4-a1cd-4ee3-ac0a-320c4a025c64.json. Use the response_location or request_id from the async response to construct or look up this path.

You must have a dataset (EOS/object storage) created and linked to your project; then select it as the async target when configuring the endpoint. Ensure your application has read access to that dataset so it can fetch the result file once the request is complete.

WebSocket API for Model Endpoints​

This explains how to call model endpoints over WebSocket on the platformβ€”for example, streaming speech-to-text (e.g. Nemotron speech streaming), real-time transcription, or other bidirectional streaming use cases.


Overview​

TopicDetails
When to use WebSocketUse WebSockets when the model or service supports streaming (e.g. send audio chunks, receive partial/final transcripts) or when you need bidirectional communication over a single connection.
Base URLFind the base URL in the Overview section of the model endpoint under Endpoint URL, or go to Sample API Request and use the Root Endpoint. Example: https://infer.e2enetworks.net/project/p-5520/endpoint/is-8458.
WebSocket pathAppend the WebSocket route to your base URL and change the base url protocol to wss. For example, if the WebSocket route is /ws, the final WebSocket URL will be wss://infer.e2enetworks.net/project/.../endpoint/.../ws.
AuthenticationWebSocket connections on our platform work only with headers; send the Authorization header when connecting (e.g. Authorization: Bearer <token>).

To get or generate a token: go to API Tokens in your project (sidebar), click Create Token and enter a name (or use an existing token), then copy the Auth Token from the list. See API Tokens for details. You use this same token in the Sample API Request section when testing the endpoint.

Generic WebSocket connection​

The examples below show how to form the WebSocket URL and pass the Authorization header when connecting. Implement your own send/receive logic in the indicated places.

Python

Uses the websocket-client library. Install: pip install websocket-client.


import websocket

# ── Configuration ─────────────────────────────────────────────────────────────
# Base URL from your endpoint (e.g. from the endpoint detail page in the dashboard)
API_BASE_URL = "wss://<BASE_URL>" # Eg. API_BASE_URL = "wss://infer.e2enetworks.net/project/p-5520/endpoint/is-8482"
AUTH_TOKEN = "<AUTH_TOKEN>" # From endpoint detail page -> Go to sample api token & copy the token
# WebSocket route (e.g. "/ws") β€” use the route your endpoint expects
WS_ROUTE = "/ws"
# ─────────────────────────────────────────────────────────────────────────────

# WebSocket URL = base URL + WebSocket route
WS_URL = f"{API_BASE_URL}{WS_ROUTE}"

# Authentication must be sent in headers (required on this platform)
headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}

ws = websocket.WebSocket()
ws.connect(WS_URL, header=headers)

# Optional: read initial "ready" message if your endpoint sends one
# ready_msg = ws.recv()
# ... parse and use as needed

# --- Implement your logic here: send data ---
# e.g. ws.send(binary_or_text_payload)
# ...

# --- Implement your logic here: receive messages ---
# msg = ws.recv()
# ... process message ...
# ...

ws.close()
Node.js

Uses the ws library. Install: npm install ws (or yarn add ws). Authentication must be sent in headers when creating the WebSocket; query parameters are not used for auth.


const WebSocket = require("ws");

// ── Configuration ─────────────────────────────────────────────────────────────
// Base URL from your endpoint (e.g. from the endpoint detail page in the dashboard)
const API_BASE_URL = "wss://<BASE_URL>" // Eg. API_BASE_URL = "wss://infer.e2enetworks.net/project/p-5520/endpoint/is-8482"
const AUTH_TOKEN = "<AUTH_TOKEN>"; // From endpoint detail page -> Go to sample api token & copy the token
// WebSocket route (e.g. "/ws") β€” use the route your endpoint expects
const WS_ROUTE = "/ws";
// ─────────────────────────────────────────────────────────────────────────────

// WebSocket URL = base URL + WebSocket route
const WS_URL = `${API_BASE_URL}${WS_ROUTE}`;

// Authentication must be sent in headers (required on this platform)
const ws = new WebSocket(WS_URL, {
headers: {
Authorization: `Bearer ${AUTH_TOKEN}`,
},
});

ws.on("open", function () {
// Optional: read initial "ready" message if your endpoint sends one
// ws.once("message", (data) => { ... });

// --- Implement your logic here: send data ---
// e.g. ws.send(payload);
// ...
});

ws.on("message", function (data) {
// --- Implement your logic here: handle incoming messages ---
// ...
});

ws.on("close", function () {
// --- Implement your logic here: cleanup or final output ---
// ...
});

Flow in short​

  1. Configure β€” Enable async for the endpoint, set the target (dataset), and list the routes that should be async (POST/OPTIONS only).
  2. Call β€” Send a POST request to one of those routes. The API returns quickly with request_id and response_location.
  3. Process β€” The platform queues the request and processes it in the background.
  4. Result β€” When done, the response is written to the dataset at api/<id>/request/<request_id>.json (where <id> is a platform-generated identifier). You can poll the async status API with the request ID to see when it is complete, then read the JSON file from the dataset (or use the response location) to get the result.