Skip to main content

Model Endpoints

TIR provides two methods for deploying containers that serve model API endpoints for AI inference services.

Deployment Methods

1. Deploy Using Pre-built Containers (Provided by TIR)

Before launching a service with TIR’s pre-built containers, you must first create a TIR Model and upload the required model files. These containers automatically:

  • Download model files from an EOS (E2E Object Storage) bucket.
  • Initialize the containerized API server for inference.

Once the endpoint is ready, you can make synchronous API requests for inference. ➡️ Learn more in this tutorial

2. Deploy Using Your Own Container

You can also deploy using your own Docker image (public or private). Once deployed, inference requests can be made synchronously. Optionally, attach a TIR Model to automate downloading model files from EOS. ➡️ Learn more in this tutorial

Create Model Endpoints

Step-by-Step

  1. Go to Inference → Model Endpoints.
  2. Click CREATE ENDPOINT.
  3. Choose a Framework and its version. Use the search bar to find specific frameworks.
  4. Click Link with Model Repository and select from Model Repository.
  5. Or click Download from Hugging Face and select a token from HF Token Integration. If no token exists, click Click Here to create a new one. Add Integration Name and Token, then click Create.

Plan Details

Machine Selection

Select a machine type (GPU or CPU). Choose between Committed or Hourly Billed options. Apply filters to refine results.

LLM Settings (Optional)

LLM settings define the behavior, efficiency, and performance of a large language model during inference. These include:

  • LLM Settings – Load Format, Data Type, Quantization, etc.
  • System Settings – GPU utilization, Swap Space, and parallel workers.
  • Tokenizer Settings – Tokenizer type and text-processing configurations.

Serverless Configuration

Hourly Billed Inference

  • Active Workers can range from 0 to max.
  • When idle, they scale to 0 automatically (no billing).

Committed Inference

  • Active Workers must always be >0.

Environment Variables

Add environment variables by clicking ADD VARIABLE.

Summary

Review configuration details before creation. You can edit any section.

  • Hourly Billed Inference and Committed Inference summaries display plan details and billing info. If Committed Inference is selected, confirm cycle preferences. You can later change the serving model or model version.

Model Endpoint Dashboard

Overview

Displays endpoint and plan details.

Serverless Configuration

Modify Active and Max Workers. Unequal values enable autoscaling.

Async Invocations

Enable asynchronous inference to allow non-blocking requests. You can set up queues for async outputs. Async requests can be made using cURL or OpenAI SDK.

View async request status under Async Request Status. Modify queues or disable async invocation anytime.

Logs and Events

  • Logs – Real-time logs.
  • Deployment Events – Lifecycle updates.
  • Request Logs – Request-level logs.

Monitoring

Switch between Hardware and Service metrics:

  • GPU Utilization, Memory, Temperature
  • CPU Utilization, Memory

Network & Security

Attach or update Security Groups to control inbound/outbound access.

Actions

  • Stop Endpoint – Pause service.
  • Start Endpoint – Resume service.
  • Update Endpoint – Modify configurations.
  • Delete Endpoint – Permanently remove service.

Search Model Endpoints

Search by name or use advanced filters.

Your Model Endpoints are now fully manageable through the TIR Dashboard with real-time monitoring, scaling, async processing, and version control.