Features
TIR Model Endpoints provide the building blocks you need to deploy and operate inference APIs reliably.
Deployment Options
Deploy with TIR prebuilt containers or your own image.
βServerless Configuration
Configure workers, autoscaling policies, and metric-based scaling.
βOperations
Monitor logs, metrics, lifecycle events, and endpoint actions.
βNetwork and Security
Use security groups and private VPC access for endpoints.
βPlayground
Test chat endpoints with configurable generation controls.
βAdvanced Runtime Settings
Tune LLM, tokenizer, system, and environment settings.
βAsync Invocation
Run long requests in the background with queued processing.
βWebSocket API
Connect to streaming and bidirectional inference routes.
βAlerts Management
Configure threshold-based triggers and automated email notifications for proactive resource monitoring.
β1. Deployment optionsβ
-
Deploy Using Pre-built Containers (Provided by TIR) β Before launching a service with TIR pre-built containers, you must first create a TIR Model and upload the required model files. These containers automatically:
- Download model files from an EOS (E2E Object Storage) bucket.
- Initialize the containerized API server for inference.
Once the endpoint is ready, you can make synchronous API requests for inference. β‘οΈ Learn more in this tutorial.
-
Deploy Using Your Own Container β You can also deploy using your own Docker image (public or private). Once deployed, inference requests can be made synchronously. Optionally, attach a TIR Model to automate downloading model files from EOS. β‘οΈ Learn more in this tutorial.
-
Model sources β Attach a Model Repository (e.g. S3, object storage) or pull directly from Hugging Face so the platform can load the model automatically when the endpoint starts.
-
Validation before deploy β Validate your model and configuration before creating an endpoint to catch issues early.
-
Multiple frameworks β Support for LLMs, embeddings, speech (ASR/TTS), diffusion, and other model types across regions and GPU types.
2. Serverless Configurationβ
Serverless configuration controls how your model endpoint automatically scales based on incoming traffic and runtime load. It allows you to define the minimum and maximum number of workers, ensuring your service remains responsive during peak usage while scaling down during idle periods to reduce cost.
- Set Active Workers to define the minimum number of running workers.
- Set Max Workers to cap the maximum scale during high traffic.
- When Active Workers and Max Workers are different, autoscaling is enabled.
- Scale to zero when idle for true pay-only-when-used billing.
- Manual scaling β Scale up or down on demand when you need an immediate change.
Serverless configuration works together with the Scaling Policy to determine when new workers are added or removed based on real-time metrics.
Metric Typesβ
The Metric Type defines what signal drives autoscaling. Three options are supported:
1. Concurrent Request Count
Scales based on the number of requests being actively processed by a single worker at the same time. When the number of in-flight requests exceeds the configured Target Value, a new worker is added.
- Best for: latency-sensitive LLM workloads where each request holds GPU resources for the full generation duration.
- Example: set Target Value to
10β a new worker spins up once a worker is handling 10 concurrent requests.
2. Request Rate per Second
Scales based on the rate of incoming requests (requests per second) arriving at a single worker. When throughput exceeds the Target Value, new workers are added to distribute load.
- Best for: high-throughput APIs where requests are short-lived and arrival rate is the primary bottleneck.
- Example: set Target Value to
50β a new worker is added once a worker is receiving more than 50 requests per second.
3. Custom
Scales based on a runtime-specific metric exposed by the serving framework (e.g. vLLM, SGLang, Triton, PyTorch). You specify the Metric Name and Target Value directly.
- Best for: workloads where internal model state (queue depth, GPU cache usage, token throughput) is a better signal than raw request counts.
- See Supported Custom Metrics by Runtime below for available metric names per framework.
Scaling Policy & Custom Metricsβ
The Scaling Policy manages the elasticity of your model endpoint. By defining specific metrics and thresholds, you can ensure your service handles traffic spikes while minimizing costs during idle periods.
Core Scaling Parametersβ
To fine-tune how and when your endpoint scales, you must configure these four key parameters:
| Parameter | Description |
|---|---|
| Metric Type | The data source used to trigger scaling (Concurrent Count, Request Rate, or Custom). |
| Target Value | The desired load handled by a single worker. When the workload exceeds this value, additional workers are automatically added. |
| Idle Timeout | Amount of time in seconds idle workers will stay running for new requests. You are charged for idle timeout. |
| Initial Cooldown Period | Amount of time service will wait after a trigger before scaling down to 0 replicas, helping prevent rapid fluctuations in replica count. |
Custom Metricsβ
Metrics that reflect the model's internal state (such as running requests, queue size, or GPU usage) and allow scaling based on actual workload.
Supported Custom Metrics by Runtimeβ
1. vLLM
vllm:num_requests_runningβ Number of requests currently being processed on the GPU.vllm:num_requests_waitingβ Number of requests queued and waiting to be scheduled.vllm:num_requests_swappedβ Number of requests swapped from GPU to CPU due to memory pressure.vllm:gpu_cache_usage_percβ GPU KV-cache usage (1 = 100%), useful for preventing GPU out-of-memory.vllm:cpu_cache_usage_percβ CPU KV-cache usage (1 = 100%).
2. SGLang
sglang:num_running_reqsβ Number of requests currently running in the decoding engine.sglang:num_queue_reqsβ Number of requests waiting in the scheduler queue.sglang:num_used_tokensβ Total number of tokens currently in use.sglang:gen_throughputβ Token generation throughput in tokens per second.
3. Triton
nv_inference_pending_request_countβ Number of inference requests waiting to be executed by the backend.nv_inference_exec_countβ Number of inference batch executions currently running.nv_inference_request_successβ Count of successfully processed inference requests.nv_inference_request_failureβ Count of failed inference requests.nv_inference_countβ Total number of inferences performed, where batch size is counted individually.
4. PyTorch
GPUUtilizationβ Percentage of GPU compute capacity currently in use.GPUMemoryUtilizationβ Percentage of GPU memory currently allocated.GPUMemoryUsedβ Amount of GPU memory used, measured in megabytes.CPUUtilizationβ Percentage of CPU capacity currently in use on the host.MemoryUtilizationβ Percentage of system memory currently in use.DiskUtilizationβ Percentage of disk space currently in use on the host.
Example Configuration:
To scale a vLLM-based LLM when the number of active running requests becomes high:
- Metric Type: Custom
- Metric Name:
vllm:num_requests_running - Target Value: 32
This configuration scales the endpoint when a worker is actively processing around 32 requests at the same time, helping maintain stable latency as load increases.
3. Operationsβ
-
Centralized logs β View inference logs and per-replica logs (e.g. last N lines or last N seconds) from one place to debug and audit requests.
-
Deployment and lifecycle events β Track when endpoints are created, updated, started, stopped, or restarted so you know the state of your deployments.
-
Monitoring β Use hardware metrics (CPU, GPU, memory) and service metrics (requests, latency, errors) to monitor health and performance and set alerts.
-
Access logs β See who called your endpoint and when (including per-hour breakdowns) for compliance and usage analysis.
-
Async Invocation β For long-running or batch workloads, use async inference: submit a request, get a request ID, and poll for status or results so your app doesnβt block. Queues and status tracking are built in.
-
Lifecycle actions β Start, stop, update, or delete endpoints via API or UI. (Committed/reserved endpoints cannot be stopped during the commitment period.)
4. Network & Securityβ
Use Security Groups to control which IPs or CIDR ranges can reach your endpoint on specific ports (for example, only allowing traffic from your application nodes or VPC CIDR on port 8080) and block all other sources.
Use Attach VPC to give your inference endpoint a private IP address inside your E2E VPC, so it is reachable only from nodes and services on that private network (no public internet exposure).
What VPC Connect doesβ
| Aspect | Description |
|---|---|
| Scope | Connects a TIR Inference endpoint to an E2E MyAccount VPC using a private IP from your VPC CIDR. |
| Traffic path | Requests flow from your VPC nodes β private VPC IP β inference endpoint entirely inside E2Eβs network. |
| Access model | Endpoint is reachable only from nodes/services attached to the same VPC (and allowed by security groups). |
| Use cases | Secure, private access from app backends, internal microservices, and compliance-sensitive workloads. |
Prerequisitesβ
You need the following before you can attach VPC Connect to an endpoint:
| Requirement | Where to configure | Details |
|---|---|---|
| MyAccount VPC with E2E CIDR | MyAccount β Networking β VPC | VPC must be created using E2E-provided CIDR blocks. Custom CIDR blocks are not supported for VPC Connect. |
| Node attached to the VPC | MyAccount β Compute β Nodes | At least one node/instance attached to the VPC. You will SSH into this node to call the private endpoint. |
| TIR account | TIR portal | Active TIR account with access to the Inference service. |
| Running inference endpoint | TIR β Inference β Model Endpoints | A deployed endpoint (e.g. LLM, embeddings, vision) in the same region as your VPC. |
Important: VPC Connect only works with VPCs created using E2E-provided CIDR blocks. If you created a VPC with a custom CIDR, create a new VPC using E2Eβs default CIDR allocation before using VPC Connect.
How it works:β
-
Create/verify VPC and node (MyAccount)
- Create a VPC using E2E-provided CIDR blocks and attach at least one node to it.
- This node will be used to initiate private API calls.
-
Reserve and attach VPC IP
- In Network β Reserve IP, click Reserve from VPC and select your VPC to get a Reserved VPC IP.
- On your inference endpoint (Inference β Model Endpoints β Network & Security β VPC), choose Attach VPC, select the reserved IP, and attach it.
-
Configure security groups
- On the endpointβs Network tab, create or select a Security Group.
- Add an inbound rule: TCP, port 8080 (or your endpoint port), Source = Your Myaccount VPC IP (e.g 10.8.220.6) not the Reserved VPC IP or you can select Any Network.
- Save the security group configuration.
-
Get the Private Endpoint URL
- Open your endpoint and go to the API Request tab.
- Switch to the Private Endpoint view.
- Copy the sample
curlcommand that uses the private VPC IP (for example,http://10.8.220.7:8080/v1/completions).
-
Call the endpoint from inside the VPC (MyAccount node)
-
From your machine, SSH into a node attached to the VPC:
ssh root@<your-myaccount-node-ip> -
From that SSH session, call the private endpoint URL (example):
curl --location 'http://10.8.220.7:8080/v1/completions' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer YOUR_AUTH_TOKEN" \
--data '{
"model": "your-model-name",
"prompt": "Tell me about AI",
"max_tokens": 100
}' -
The exact and most up-to-date
curlexample is always available in the endpointβs API Request β Private Endpoint section.
-
5. Alertsβ
Alerts are threshold-based triggers that enable proactive resource management for your inference endpoints. When a monitored metric exceeds a defined limit, TIR automatically sends an email notification, allowing you to respond before performance or availability is affected.
For detailed configuration, see Alert Management.
6. Playgroundβ
Endpoint playground β Use the built-in playground to quickly test your deployed endpoints without writing code. You can configure generation parameters such as output length, temperature, top p, top k, and repetition penalty and send sample prompts to observe how the model responds.
-
Output length (max tokens) β Controls how long the modelβs response can be. Higher values allow more detailed answers but consume more tokens and may be slower.
-
Temperature β Controls randomness. Lower values (e.g. 0.1β0.3) make outputs more deterministic; higher values (e.g. 0.7β1.0) make them more diverse and creative.
-
Top p β Enables nucleus sampling by limiting choices to the smallest set of tokens whose cumulative probability is β₯ p (e.g. 0.9). Lower values reduce randomness by shrinking this set.
-
Top k β Limits each generation step to the k most likely tokens (e.g. k=50). Smaller k makes outputs more focused; larger k allows more variety.
-
Repetition penalty β Penalizes tokens that have already appeared in the context to reduce repeated phrases or loops. Higher values make the model avoid repeating itself more aggressively.
The playground currently supports chat models only. To test non-chat endpoints, use your own client or direct API calls.
7. Advanced Runtime Settingsβ
Configure how your model loads and runs at inference time. These settings are passed as engine args (framework-specific) or resource details (env variables, disk, mount path). Framework support varies (e.g. vLLM, SGLang). Configure before launch or update an existing endpoint.
-
LLM settings β Control model loading, data type (dtype), quantization, context length, and memory allocation. Use these to optimize for speed vs. quality (e.g. FP16 vs. INT8), set max context length, and tune GPU memory usage.
-
System settings β Configure the inference runtime: scheduler type, batching behavior, concurrency limits, and logging level. Tune these to balance throughput, latency, and resource utilization under load.
-
Tokenizer settings β Specify tokenizer path, mode (e.g. auto vs. manual), pool size, and chat template. Essential for correct tokenization and chat formatting when using custom models or non-standard tokenizers.
-
Environment variables β Pass runtime env vars such as API keys, secrets, file paths, or framework-specific options. Use for credentials, config overrides, and any values your model or container expects at startup.
8. Async invocationβ
Use async invocation when you want to submit an inference request, get an immediate acknowledgment, and have the platform process the request in the background and store the result in a destination you choose. Your application does not need to keep the connection open or poll in a tight loop until the result is ready.
What you get when you enable asyncβ
- Queued processing β Requests are queued and processed in the background. You get a quick response with a request ID and response location instead of waiting for the full inference result.
- Stored results β When processing is complete, the response is written to the destination you configured (e.g. a dataset in object storage). You can then read the result from that location or use the async status API with the request ID to check status and get the result.
- Serverless worker lifecycle β A serverless worker is launched to process async invocations. The worker is active and billed only while there are requests in the queue; once all requests are processed and the queue is empty, the worker automatically scales down to zero.
When you make an async call, the API response includes:
request_idβ Use this to check status or locate the result (e.g. via the async status API or the response file path).response_locationβ Indicates where the response will be (or has been) stored once the request is complete.
Setup: enable async and configure queuesβ
- Enable async for your endpoint (e.g. in the endpoint configuration or Async config section in the dashboard or API).
- Configure where responses go (Target/destination) and which routes use async (Application endpoint/routes), as described below.
Application endpoint: which routes are asyncβ
You specify the list of routes (API paths) for which async behavior is enabled. For example, you might enable async only for a long-running route such as /v1/chat/completions or a custom batch route.
| What to configure | Details for you |
|---|---|
| Routes | Enter the routes (paths) for which you want async invocation. Only these routes will accept async requests and return request_id and response_location instead of the full response in the same call. |
| HTTP methods | These routes must only accept POST and OPTIONS. Ensure your endpoint or API is configured so that the async routes do not accept other methods (e.g. GET) for the async flow. |
Make sure the routes you list are the ones your client will call when sending async inference requests.
Target: where async responses are storedβ
You choose a destination where the platform will store the response when an async request completes.
| What is supported | Details for you |
|---|---|
| Dataset (object storage / EOS) | Currently supported: a dataset in object storage (EOS). When an async request completes, the platform writes a JSON file containing the response. |
| File path | The file is created at api/<id>/request/<request_id>.json relative to the root of the dataset, where <id> is a platform-generated identifier (e.g. is-8404). For example, if your dataset root is s3://my-bucket/async-results/, the response for request ID 1fa135a4-a1cd-4ee3-ac0a-320c4a025c64 might be at .../async-results/api/is-8404/request/1fa135a4-a1cd-4ee3-ac0a-320c4a025c64.json. Use the response_location or request_id from the async response to construct or look up this path. |
You must have a dataset (EOS/object storage) created and linked to your project; then select it as the async target when configuring the endpoint. Ensure your application has read access to that dataset so it can fetch the result file once the request is complete.
WebSocket API for Model Endpointsβ
This explains how to call model endpoints over WebSocket on the platformβfor example, streaming speech-to-text (e.g. Nemotron speech streaming), real-time transcription, or other bidirectional streaming use cases.
Overviewβ
| Topic | Details |
|---|---|
| When to use WebSocket | Use WebSockets when the model or service supports streaming (e.g. send audio chunks, receive partial/final transcripts) or when you need bidirectional communication over a single connection. |
| Base URL | Find the base URL in the Overview section of the model endpoint under Endpoint URL, or go to Sample API Request and use the Root Endpoint. Example: https://infer.e2enetworks.net/project/p-5520/endpoint/is-8458. |
| WebSocket path | Append the WebSocket route to your base URL and change the base url protocol to wss. For example, if the WebSocket route is /ws, the final WebSocket URL will be wss://infer.e2enetworks.net/project/.../endpoint/.../ws. |
| Authentication | WebSocket connections on our platform work only with headers; send the Authorization header when connecting (e.g. Authorization: Bearer <token>).To get or generate a token: go to API Tokens in your project (sidebar), click Create Token and enter a name (or use an existing token), then copy the Auth Token from the list. See API Tokens for details. You use this same token in the Sample API Request section when testing the endpoint. |
Generic WebSocket connectionβ
The examples below show how to form the WebSocket URL and pass the Authorization header when connecting. Implement your own send/receive logic in the indicated places.
Python
Uses the websocket-client library. Install: pip install websocket-client.
import websocket
# ββ Configuration βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Base URL from your endpoint (e.g. from the endpoint detail page in the dashboard)
API_BASE_URL = "wss://<BASE_URL>" # Eg. API_BASE_URL = "wss://infer.e2enetworks.net/project/p-5520/endpoint/is-8482"
AUTH_TOKEN = "<AUTH_TOKEN>" # From endpoint detail page -> Go to sample api token & copy the token
# WebSocket route (e.g. "/ws") β use the route your endpoint expects
WS_ROUTE = "/ws"
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# WebSocket URL = base URL + WebSocket route
WS_URL = f"{API_BASE_URL}{WS_ROUTE}"
# Authentication must be sent in headers (required on this platform)
headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}
ws = websocket.WebSocket()
ws.connect(WS_URL, header=headers)
# Optional: read initial "ready" message if your endpoint sends one
# ready_msg = ws.recv()
# ... parse and use as needed
# --- Implement your logic here: send data ---
# e.g. ws.send(binary_or_text_payload)
# ...
# --- Implement your logic here: receive messages ---
# msg = ws.recv()
# ... process message ...
# ...
ws.close()
Node.js
Uses the ws library. Install: npm install ws (or yarn add ws). Authentication must be sent in headers when creating the WebSocket; query parameters are not used for auth.
const WebSocket = require("ws");
// ββ Configuration βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
// Base URL from your endpoint (e.g. from the endpoint detail page in the dashboard)
const API_BASE_URL = "wss://<BASE_URL>" // Eg. API_BASE_URL = "wss://infer.e2enetworks.net/project/p-5520/endpoint/is-8482"
const AUTH_TOKEN = "<AUTH_TOKEN>"; // From endpoint detail page -> Go to sample api token & copy the token
// WebSocket route (e.g. "/ws") β use the route your endpoint expects
const WS_ROUTE = "/ws";
// βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
// WebSocket URL = base URL + WebSocket route
const WS_URL = `${API_BASE_URL}${WS_ROUTE}`;
// Authentication must be sent in headers (required on this platform)
const ws = new WebSocket(WS_URL, {
headers: {
Authorization: `Bearer ${AUTH_TOKEN}`,
},
});
ws.on("open", function () {
// Optional: read initial "ready" message if your endpoint sends one
// ws.once("message", (data) => { ... });
// --- Implement your logic here: send data ---
// e.g. ws.send(payload);
// ...
});
ws.on("message", function (data) {
// --- Implement your logic here: handle incoming messages ---
// ...
});
ws.on("close", function () {
// --- Implement your logic here: cleanup or final output ---
// ...
});
Flow in shortβ
- Configure β Enable async for the endpoint, set the target (dataset), and list the routes that should be async (POST/OPTIONS only).
- Call β Send a POST request to one of those routes. The API returns quickly with
request_idandresponse_location. - Process β The platform queues the request and processes it in the background.
- Result β When done, the response is written to the dataset at
api/<id>/request/<request_id>.json(where<id>is a platform-generated identifier). You can poll the async status API with the request ID to see when it is complete, then read the JSON file from the dataset (or use the response location) to get the result.