Launching Llama 3 Inference Using TensorRT-LLM
Deploy the Llama 3 8B Instruct model efficiently using NVIDIA’s TensorRT-LLM backend on Triton Inference Server. This guide walks you through every step — from building the engine to deploying and testing the model.
Overview
This tutorial covers:
- Preparing the environment and model access
- Building a TensorRT engine
- Configuring inference backend
- Uploading to model repository
- Creating an endpoint for inference
- Making test API calls
You’ll need:
- A GPU-enabled compute instance
- Access to the Meta Llama 3 model
- A Hugging Face access token
Step 1: Get Model Access
- Log in or sign up on Hugging Face.
- Accept the Meta Llama 3 Community License on the model card.
- Create a read token under Hugging Face Settings → Access Tokens.
- Save the token for later.
Step 2: Create GPU Node
Launch a node with TensorRT-LLM environment:
# Recommended image
TensorRT-LLM Engine Builder v0.10.0
- Choose GPU plan (A100 preferred)
- Set Disk size ≥ 100 GB
- Launch Node → Open JupyterLab terminal once ready
Tip: Run all commands in the JupyterLab terminal.
Step 3: Download Model from Hugging Face
mkdir -p $CWD/model
pip install -U "huggingface_hub[cli]"
export HF_TOKEN=<your-hf-token>
export HF_HOME=$CWD
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir=$CWD/model --local-dir-use-symlinks=False
Step 4: Prepare Directories and Variables
export MODEL_DIR=$CWD/model
export UNIFIED_CKPT=$CWD/unified_ckpt
export ENGINE_DIR=$CWD/engine_dir
export TOKENIZER_DIR=$CWD/tokenizer_dir
export MODEL_REPO=$CWD/model_repository
mkdir -p $UNIFIED_CKPT $ENGINE_DIR $TOKENIZER_DIR $MODEL_REPO
Step 5: Convert Model to Unified Checkpoint
pip install -r /app/tensorrt_llm/examples/llama/requirements.txt --no-cache-dir
python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir ${MODEL_DIR} --output_dir ${UNIFIED_CKPT} --dtype float16
Step 6: Build TensorRT Engine
trtllm-build --checkpoint_dir ${UNIFIED_CKPT} \
--remove_input_padding enable --gpt_attention_plugin float16 \
--context_fmha enable --gemm_plugin float16 \
--output_dir ${ENGINE_DIR} --paged_kv_cache enable \
--max_batch_size 64
Step 7: Copy Tokenizer Files
cp $MODEL_DIR/{tokenizer.json,tokenizer_config.json,special_tokens_map.json,config.json} $TOKENIZER_DIR
Step 8: (Optional) Test Locally
Run inference directly to verify the engine:
python /app/tensorrt_llm/examples/run.py --max_output_len 500 \
--tokenizer_dir ${TOKENIZER_DIR} --engine_dir ${ENGINE_DIR} \
--input_text "Explain quantum computing in simple terms."
Step 9: Configure Backend Repository
Clone backend and setup inflight batcher:
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git $CWD/tensorrtllm_backend
cd $CWD/tensorrtllm_backend
git checkout v0.10.0
cp -r all_models/inflight_batcher_llm/* $MODEL_REPO
Update configs:
export MOUNT_PATH=/mnt/models
export ENGINE_PATH=$MOUNT_PATH/tensorrt_llm/1/engine
export TOKENIZER_PATH=$MOUNT_PATH/tensorrt_llm/1/tokenizer_dir
python tools/fill_template.py -i ${MODEL_REPO}/tensorrt_llm/config.pbtxt \
triton_backend:tensorrtllm,engine_dir:${ENGINE_PATH},tokenizer_dir:${TOKENIZER_PATH},max_batch_size:64
Step 10: Upload Model Repository
mc cp -r ${MODEL_REPO}/* my-llama3-model-repo/
Step 11: Create Inference Endpoint
- Create Model Endpoint
- Choose TensorRT-LLM Framework
- Select your model repository
- Choose runtime version v0.10.0
- Pick GPU plan and launch
The endpoint will enter Running once engine initialization completes.
Step 12: Test the Endpoint
Health Check
curl -H "Authorization: Bearer <auth-token>" \
<endpoint-url>/v2/health/ready
Inference
curl -X POST <endpoint-url>/v2/models/ensemble/generate \
-H "Authorization: Bearer <auth-token>" \
-H "Content-Type: application/json" \
-d '{
"text_input": "Explain AI alignment.",
"max_tokens": 500,
"temperature": 0.7
}'
Example Response
{
"model_name": "ensemble",
"sequence_id": 0,
"text_output": "AI alignment refers to designing systems whose goals match human values and intentions."
}
Notes & Tips
- ✅ Use GPU A100 or H100 for best performance
- ✅ Upload model repository to object storage for fast startup
- ✅ Check logs for
engine loaded successfullymessage - ✅ Use smaller
max_tokensfor quick tests
Llama 3 inference using TensorRT-LLM is now live and ready for production deployment!