Skip to main content

Model Endpoints

Model Endpoints let you deploy your AI/ML models and serve them as ready-to-use APIs — without managing any infrastructure. Just pick a model, deploy it, and start making inference requests.

Serving FrameworksAutoscalingAsync InvocationOpenAI-Compatible APIs

Quick Start


What are Model Endpoints?

Model Endpoints are deployed services that expose your AI/ML models as inference APIs. They allow you to:

Serve models via synchronous and asynchronous API requests

Choose serving frameworks like vLLM, SGLang, Triton, or PyTorch

Use custom containers for custom runtimes

Scale automatically with configurable serverless workers and scaling policies

Monitor and manage endpoints through logs, metrics, and security controls

Key Characteristics

Deployment

Flexible Serving Options

Choose frameworks (vLLM, SGLang, Triton, PyTorch), pick model source (Hugging Face or Model Repository), or use a custom container.

Serverless

Worker Configuration

Control Active Workers (minimum) and Max Workers (maximum). Autoscaling is enabled when minimum and maximum differ.

Scaling

Traffic-Based Policies

Scale by concurrent request count, request rate per second, or a custom metric for workload-specific behavior.

Billing

Hourly or Committed

Choose Hourly Billed (can scale to zero when idle) or Committed Inference for always-on reserved capacity.

LLM Runtime

Advanced LLM Settings

Tune quantization, GPU utilization, tokenizer options, and runtime behavior for large language model deployments.

Async

Optional Async Invocation

Run asynchronous requests with queue-based execution and status tracking for long-running inference workloads.


Why Use Model Endpoints?

No server management: the platform manages GPUs, scaling, and availability

Scale with demand using one or many replicas with autoscaling

Pay for what you use with hourly or committed billing models

Built-in observability with logs, metrics, and access logs

Security and control with security groups and private IP (VPC) options


What You Can Do

01

Manage Your Models

  • Register or create models from storage sources
  • Validate model readiness before deployment
  • Upload or sync model artifacts for endpoint serving
02

Deploy and Run Endpoints

  • Create endpoints with framework, resource, and replica configuration
  • List and inspect endpoint status and details
  • Start, stop, restart, and update endpoint configuration
03

Scale and Performance

  • Scale replicas manually when needed
  • Enable autoscaling with min and max worker limits
04

Security and Networking

  • Attach security groups to control endpoint access
  • Use VPC private IP integration for internal connectivity
05

Observability and Operations

  • View inference logs and per-replica logs
  • Track access logs and request activity
  • Monitor hardware and service metrics for health and latency

API Reference

REST API

</>Model Endpoints API Reference

Programmatically create, list, update, manage actions, and delete model endpoints.

Explore REST APIs
Authentication & Endpoints
Request and Response Schemas
Open API Reference →
tir.e2enetworks.com / api / v1
GET/teams/{Team_Id}/projects/{Project_Id}/serving/inferenceList model endpoints
POST/teams/{Team_Id}/projects/{Project_Id}/serving/inference/Create model endpoint
PUT/teams/{Team_Id}/projects/{Project_Id}/serving/inference/{inference_id}/Update model endpoint
PUT/teams/{Team_Id}/projects/{Project_Id}/serving/inference/{Endpoint_Id}/Start, stop, or restart endpoint
DELETE/teams/{Team_Id}/projects/{Project_Id}/serving/inference/{Endpoint_Id}/Delete model endpoint