Inference Tutorials
Custom Containers
Go through this detailed tutorial to building a custom images for model deployments.
Pre-built Containers
TIR provides docker container images that you can run as pre-built containers. These containers provide inference servers (HTTP) that can serve inference requests with minimal configuration. They are also capable of connecting with E2E Object Storage and downloading models on the containers at the startup.
This section lists deployment guides for all the integrated frameworks that TIR supports.
TorchServe
Go through this complete guide for deploying a torchserve service.
NVIDIA Triton
Go through this detailed guide to deploy a triton service.
LLAMA2
Go through LLAMA2 tutorial to deploy LLAMA v2.
CodeLLMA
Go through CodeLLMA tutorial to deploy CodeLLMA Service.
Stable Diffusion
Go through Stable Diffusion tutorial to deploy Stable Diffusion Inference Service.
Stable Video Diffusion XT
Go through this tutorial to deploy stable Video Diffusion XT inference service.
Gemma
Go through Gemma tutorial to deploy Gemma Inference Service.
LLAMA 3 8B-IT
Go through LLAMA 3 8B-IT tutorial to deploy LLAMA 3 8B-IT Inference Service.
vLLM
Go through this tutorial to deploy LLAMA 3 8B-IT Inference Service.
YOLOv8
Go through this tutorial to deploy YOLOv8 Inference Service.
MPT-7B-CHAT
Go through this tutorial to deploy MPT-7B-CHAT Inference Service.
LLAMA 3 Inference Using TensorRT-LLM
Go through this tutorial to deploy LLAMA 3 Inference Using TensorRT-LLM .
Natural Language Queries to SQL with Code-Llama
Go through this tutorial to generate SQL queries by writing natural language queries.
VLLM with OpenAI client
vLLM provides an HTTP server that implements OpenAI’s Completions and Chat API. Go through this tutorial to know more.