Inference Tutorials

Custom Containers

Go through this detailed tutorial to building a custom images for model deployments.

Pre-built Containers

TIR provides docker container images that you can run as pre-built containers. These containers provide inference servers (HTTP) that can serve inference requests with minimal configuration. They are also capable of connecting with E2E Object Storage and downloading models on the containers at the startup.

This section lists deployment guides for all the integrated frameworks that TIR supports.

TorchServe

Go through this complete guide for deploying a torchserve service.

NVIDIA Triton

Go through this detailed guide to deploy a triton service.

LLAMA2

Go through LLAMA2 tutorial to deploy LLAMA v2.

CodeLLMA

Go through CodeLLMA tutorial to deploy CodeLLMA Service.

Stable Diffusion

Go through Stable Diffusion tutorial to deploy Stable Diffusion Inference Service.

Stable Video Diffusion XT

Go through this tutorial to deploy stable Video Diffusion XT inference service.

Gemma

Go through Gemma tutorial to deploy Gemma Inference Service.

LLAMA 3 8B-IT

Go through LLAMA 3 8B-IT tutorial to deploy LLAMA 3 8B-IT Inference Service.

vLLM

Go through this tutorial to deploy LLAMA 3 8B-IT Inference Service.

YOLOv8

Go through this tutorial to deploy YOLOv8 Inference Service.

MPT-7B-CHAT

Go through this tutorial to deploy MPT-7B-CHAT Inference Service.

LLAMA 3 Inference Using TensorRT-LLM

Go through this tutorial to deploy LLAMA 3 Inference Using TensorRT-LLM .

Natural Language Queries to SQL with Code-Llama

Go through this tutorial to generate SQL queries by writing natural language queries.

VLLM with OpenAI client

vLLM provides an HTTP server that implements OpenAI’s Completions and Chat API. Go through this tutorial to know more.