Model Endpoints
Model Endpoints let you deploy your AI/ML models and serve them as ready-to-use APIs — without managing any infrastructure. Just pick a model, deploy it, and start making inference requests.
Quick Start
Quick Start Guide
Deploy your first model endpoint and start serving requests.
Features
Explore deployment, scaling, networking, and runtime features.
Plans & Pricing
Understand endpoint plans, pricing, and billing behavior.
FAQs
Find troubleshooting help and answers for setup, deployment, scaling, and operations.
What are Model Endpoints?
Model Endpoints are deployed services that expose your AI/ML models as inference APIs. They allow you to:
Serve models via synchronous and asynchronous API requests
Choose serving frameworks like vLLM, SGLang, Triton, or PyTorch
Use custom containers for custom runtimes
Scale automatically with configurable serverless workers and scaling policies
Monitor and manage endpoints through logs, metrics, and security controls
Key Characteristics
Deployment
Flexible Serving Options
Choose frameworks (vLLM, SGLang, Triton, PyTorch), pick model source (Hugging Face or Model Repository), or use a custom container.
Serverless
Worker Configuration
Control Active Workers (minimum) and Max Workers (maximum). Autoscaling is enabled when minimum and maximum differ.
Scaling
Traffic-Based Policies
Scale by concurrent request count, request rate per second, or a custom metric for workload-specific behavior.
Billing
Hourly or Committed
Choose Hourly Billed (can scale to zero when idle) or Committed Inference for always-on reserved capacity.
LLM Runtime
Advanced LLM Settings
Tune quantization, GPU utilization, tokenizer options, and runtime behavior for large language model deployments.
Async
Optional Async Invocation
Run asynchronous requests with queue-based execution and status tracking for long-running inference workloads.
Why Use Model Endpoints?
No server management: the platform manages GPUs, scaling, and availability
Scale with demand using one or many replicas with autoscaling
Pay for what you use with hourly or committed billing models
Built-in observability with logs, metrics, and access logs
Security and control with security groups and private IP (VPC) options
What You Can Do
Manage Your Models
- Register or create models from storage sources
- Validate model readiness before deployment
- Upload or sync model artifacts for endpoint serving
Deploy and Run Endpoints
- Create endpoints with framework, resource, and replica configuration
- List and inspect endpoint status and details
- Start, stop, restart, and update endpoint configuration
Scale and Performance
- Scale replicas manually when needed
- Enable autoscaling with min and max worker limits
Security and Networking
- Attach security groups to control endpoint access
- Use VPC private IP integration for internal connectivity
Observability and Operations
- View inference logs and per-replica logs
- Track access logs and request activity
- Monitor hardware and service metrics for health and latency
API Reference
Model Endpoints API Reference
Programmatically create, list, update, manage actions, and delete model endpoints.
/teams/{Team_Id}/projects/{Project_Id}/serving/inferenceList model endpoints/teams/{Team_Id}/projects/{Project_Id}/serving/inference/Create model endpoint/teams/{Team_Id}/projects/{Project_Id}/serving/inference/{inference_id}/Update model endpoint/teams/{Team_Id}/projects/{Project_Id}/serving/inference/{Endpoint_Id}/Start, stop, or restart endpoint/teams/{Team_Id}/projects/{Project_Id}/serving/inference/{Endpoint_Id}/Delete model endpoint