Skip to main content

Launching Llama 3 Inference Using TensorRT-LLM

Deploy the Llama 3 8B Instruct model efficiently using NVIDIA’s TensorRT-LLM backend on Triton Inference Server. This guide walks you through every step — from building the engine to deploying and testing the model.


Overview

This tutorial covers:

  1. Preparing the environment and model access
  2. Building a TensorRT engine
  3. Configuring inference backend
  4. Uploading to model repository
  5. Creating an endpoint for inference
  6. Making test API calls

You’ll need:

  • A GPU-enabled compute instance
  • Access to the Meta Llama 3 model
  • A Hugging Face access token

Step 1: Get Model Access

  1. Log in or sign up on Hugging Face.
  2. Accept the Meta Llama 3 Community License on the model card.
  3. Create a read token under Hugging Face Settings → Access Tokens.
  4. Save the token for later.

Step 2: Create GPU Node

Launch a node with TensorRT-LLM environment:

# Recommended image
TensorRT-LLM Engine Builder v0.10.0
  • Choose GPU plan (A100 preferred)
  • Set Disk size ≥ 100 GB
  • Launch Node → Open JupyterLab terminal once ready

Tip: Run all commands in the JupyterLab terminal.


Step 3: Download Model from Hugging Face

mkdir -p $CWD/model
pip install -U "huggingface_hub[cli]"
export HF_TOKEN=<your-hf-token>
export HF_HOME=$CWD
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir=$CWD/model --local-dir-use-symlinks=False

Step 4: Prepare Directories and Variables

export MODEL_DIR=$CWD/model
export UNIFIED_CKPT=$CWD/unified_ckpt
export ENGINE_DIR=$CWD/engine_dir
export TOKENIZER_DIR=$CWD/tokenizer_dir
export MODEL_REPO=$CWD/model_repository

mkdir -p $UNIFIED_CKPT $ENGINE_DIR $TOKENIZER_DIR $MODEL_REPO

Step 5: Convert Model to Unified Checkpoint

pip install -r /app/tensorrt_llm/examples/llama/requirements.txt --no-cache-dir
python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir ${MODEL_DIR} --output_dir ${UNIFIED_CKPT} --dtype float16

Step 6: Build TensorRT Engine

trtllm-build --checkpoint_dir ${UNIFIED_CKPT} \
--remove_input_padding enable --gpt_attention_plugin float16 \
--context_fmha enable --gemm_plugin float16 \
--output_dir ${ENGINE_DIR} --paged_kv_cache enable \
--max_batch_size 64

Step 7: Copy Tokenizer Files

cp $MODEL_DIR/{tokenizer.json,tokenizer_config.json,special_tokens_map.json,config.json} $TOKENIZER_DIR

Step 8: (Optional) Test Locally

Run inference directly to verify the engine:

python /app/tensorrt_llm/examples/run.py --max_output_len 500 \
--tokenizer_dir ${TOKENIZER_DIR} --engine_dir ${ENGINE_DIR} \
--input_text "Explain quantum computing in simple terms."

Step 9: Configure Backend Repository

Clone backend and setup inflight batcher:

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git $CWD/tensorrtllm_backend
cd $CWD/tensorrtllm_backend
git checkout v0.10.0
cp -r all_models/inflight_batcher_llm/* $MODEL_REPO

Update configs:

export MOUNT_PATH=/mnt/models
export ENGINE_PATH=$MOUNT_PATH/tensorrt_llm/1/engine
export TOKENIZER_PATH=$MOUNT_PATH/tensorrt_llm/1/tokenizer_dir

python tools/fill_template.py -i ${MODEL_REPO}/tensorrt_llm/config.pbtxt \
triton_backend:tensorrtllm,engine_dir:${ENGINE_PATH},tokenizer_dir:${TOKENIZER_PATH},max_batch_size:64

Step 10: Upload Model Repository

mc cp -r ${MODEL_REPO}/* my-llama3-model-repo/

Step 11: Create Inference Endpoint

  1. Create Model Endpoint
  2. Choose TensorRT-LLM Framework
  3. Select your model repository
  4. Choose runtime version v0.10.0
  5. Pick GPU plan and launch

The endpoint will enter Running once engine initialization completes.


Step 12: Test the Endpoint

Health Check

curl -H "Authorization: Bearer <auth-token>" \
<endpoint-url>/v2/health/ready

Inference

curl -X POST <endpoint-url>/v2/models/ensemble/generate \
-H "Authorization: Bearer <auth-token>" \
-H "Content-Type: application/json" \
-d '{
"text_input": "Explain AI alignment.",
"max_tokens": 500,
"temperature": 0.7
}'

Example Response

{
"model_name": "ensemble",
"sequence_id": 0,
"text_output": "AI alignment refers to designing systems whose goals match human values and intentions."
}

Notes & Tips

  • ✅ Use GPU A100 or H100 for best performance
  • ✅ Upload model repository to object storage for fast startup
  • ✅ Check logs for engine loaded successfully message
  • ✅ Use smaller max_tokens for quick tests

Llama 3 inference using TensorRT-LLM is now live and ready for production deployment!