Model Endpoints
TIR provides two methods for deploying containers that serve model API endpoints for AI inference services.
Deployment Methods
1. Deploy Using Pre-built Containers (Provided by TIR)
Before launching a service with TIR’s pre-built containers, you must first create a TIR Model and upload the required model files. These containers automatically:
- Download model files from an EOS (E2E Object Storage) bucket.
- Initialize the containerized API server for inference.
Once the endpoint is ready, you can make synchronous API requests for inference. ➡️ Learn more in this tutorial
2. Deploy Using Your Own Container
You can also deploy using your own Docker image (public or private). Once deployed, inference requests can be made synchronously. Optionally, attach a TIR Model to automate downloading model files from EOS. ➡️ Learn more in this tutorial
Create Model Endpoints
Step-by-Step
- Go to Inference → Model Endpoints.
- Click CREATE ENDPOINT.
- Choose a Framework and its version. Use the search bar to find specific frameworks.
- Click Link with Model Repository and select from Model Repository.
- Or click Download from Hugging Face and select a token from HF Token Integration. If no token exists, click Click Here to create a new one. Add Integration Name and Token, then click Create.
Plan Details
Machine Selection
Select a machine type (GPU or CPU). Choose between Committed or Hourly Billed options. Apply filters to refine results.
LLM Settings (Optional)
LLM settings define the behavior, efficiency, and performance of a large language model during inference. These include:
- LLM Settings – Load Format, Data Type, Quantization, etc.
- System Settings – GPU utilization, Swap Space, and parallel workers.
- Tokenizer Settings – Tokenizer type and text-processing configurations.
Serverless Configuration
Hourly Billed Inference
- Active Workers can range from 0 to max.
- When idle, they scale to 0 automatically (no billing).
Committed Inference
- Active Workers must always be >0.
Environment Variables
Add environment variables by clicking ADD VARIABLE.
Summary
Review configuration details before creation. You can edit any section.
- Hourly Billed Inference and Committed Inference summaries display plan details and billing info. If Committed Inference is selected, confirm cycle preferences. You can later change the serving model or model version.
Model Endpoint Dashboard
Overview
Displays endpoint and plan details.
Serverless Configuration
Modify Active and Max Workers. Unequal values enable autoscaling.
Async Invocations
Enable asynchronous inference to allow non-blocking requests. You can set up queues for async outputs. Async requests can be made using cURL or OpenAI SDK.
View async request status under Async Request Status. Modify queues or disable async invocation anytime.
Logs and Events
- Logs – Real-time logs.
- Deployment Events – Lifecycle updates.
- Request Logs – Request-level logs.
Monitoring
Switch between Hardware and Service metrics:
- GPU Utilization, Memory, Temperature
- CPU Utilization, Memory
Network & Security
Attach or update Security Groups to control inbound/outbound access.
Actions
- Stop Endpoint – Pause service.
- Start Endpoint – Resume service.
- Update Endpoint – Modify configurations.
- Delete Endpoint – Permanently remove service.
Search Model Endpoints
Search by name or use advanced filters.
Your Model Endpoints are now fully manageable through the TIR Dashboard with real-time monitoring, scaling, async processing, and version control.