Model Endpoints

It is easy to deploy containers that serve model API.

TIR offers two methods to create an inference service (API endpoint) for your AI model:

  • Deploy using pre-built (TIR provided) Containers

    Before you launch a service with pre-built containers, you must first create a TIR Model and upload model files to it. The pre-built containers are designed to auto-download the model files from EOS (E2E Object Storage) Bucket and launch the API server with them. Once an endpoint is ready, you can send synchronous request to the endpoint for inference.

  • Deploy using your own container

    You can provide a public or private docker image and launch an inference service with it. Once endpoint is ready, you can make synchronous requests to the endpoint for inference. You may also choose to attach a tir model to your service, to automate download of model files from EOS bucket to the container.

Pre-built Containers

TIR provides docker container images that you can run as pre-built containers. These containers provide inference servers (HTTP) that can serve inference requests with minimal configuration. They are also capable of connecting with E2E Object Storage and downloading models on the containers at the startup.

This section lists deployment guides for all the integrated frameworks that TIR supports.


Go through this complete guide for deploying a torchserve service.


Go through this detailed guide to deploy a triton service.


Go through this tutorial to deploy LLAMA v2.


Go through this tutorial to deploy CodeLLMA Service.

Stable Diffusion

Go through this tutorial to deploy Stable Diffusion Inference Service.


Go through this tutorial to deploy Gemma Inference Service.


Go through this tutorial to deploy LLAMA 3 8B-IT Inference Service.


Go through this tutorial to deploy LLAMA 3 8B-IT Inference Service.

Custom Containers

Go through this detailed tutorial to building a custom images for model deployments.

Create Model Endpoints

  • To create a Model Endppoint you have to click on Model Endpoints under Inference.

  • Click on CREATE ENDPOINT button.

  • Choose framework

  • Model download then click on next.


Resource Details

  • Machine

    Here you can select a machine type either GPU or CPU.

../_images/model_end_GPU.png ../_images/model_end_CPU.png
  • Replicas

    The number of replicas you specify in the input field below will always be ready and you will be charged continuously for them

    Enable Autoscaling

  • Enable Autoscaling

    Here you have an option to enable Autoscaling.

    ../_images/model_endpoints_enabeautoscaling.png ../_images/model_endpoints_enabeautoscaling1.png

Endpoint Details

  • Endpoint Name

    Here you define the name of your end points then click on Next button.


Environment Variables

  • Add Variable

    You can add variable by clicking ADD VARIABLE button.


    After adding Variables


    To delete variable, click on delete icon.

  • Click on FINISH button.

  • Click on CREATE button.


After successfully creating model endpoints, you will see the following screen.

Overview In that overview tab, you can see the Endpoint Details and Plan Details.


Logs In the log tab, logs are generated, and you can view them by clicking on the log tab.


Monitoring In the Monitoring tab, you can choose between Hardware and Service as metric types. Under Hardware, you’ll find details like GPU Utilization, GPU Memory Utilization, GPU Temperature, GPU Power Usage, CPU Utilization, and Memory Usage.


Auto Scaling In the Auto Scaling tab, you can view the current number of replicas, desired replicas, and also have the option to disable auto-scaling.You can increase the number of desired replicas.


Replica Management In the Replica Management tab, you can handle the replicas. You’ll see the current replicas, and you also have the option to delete them by clicking delete icon.