# Launching Llama 3 Inference Using TensorRT-LLM Deploy the **Llama 3 8B Instruct** model efficiently using NVIDIA’s **TensorRT-LLM backend** on **Triton Inference Server**. This guide walks you through every step — from building the engine to deploying and testing the model. --- ## Overview This tutorial covers: 1. Preparing the environment and model access 2. Building a TensorRT engine 3. Configuring inference backend 4. Uploading to model repository 5. Creating an endpoint for inference 6. Making test API calls You’ll need: * A GPU-enabled compute instance * Access to the [Meta Llama 3 model](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) * A Hugging Face access token --- ## Step 1: Get Model Access 1. Log in or sign up on [Hugging Face](https://huggingface.co/join). 2. Accept the **Meta Llama 3 Community License** on the model card. 3. Create a **read** token under [Hugging Face Settings → Access Tokens](https://huggingface.co/settings/tokens). 4. Save the token for later. --- ## Step 2: Create GPU Node Launch a node with TensorRT-LLM environment: ```bash # Recommended image TensorRT-LLM Engine Builder v0.10.0 ``` * Choose GPU plan (A100 preferred) * Set Disk size ≥ 100 GB * Launch Node → Open JupyterLab terminal once ready > **Tip:** Run all commands in the JupyterLab terminal. --- ## Step 3: Download Model from Hugging Face ```bash mkdir -p $CWD/model pip install -U "huggingface_hub[cli]" export HF_TOKEN= export HF_HOME=$CWD huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \ --local-dir=$CWD/model --local-dir-use-symlinks=False ``` --- ## Step 4: Prepare Directories and Variables ```bash export MODEL_DIR=$CWD/model export UNIFIED_CKPT=$CWD/unified_ckpt export ENGINE_DIR=$CWD/engine_dir export TOKENIZER_DIR=$CWD/tokenizer_dir export MODEL_REPO=$CWD/model_repository mkdir -p $UNIFIED_CKPT $ENGINE_DIR $TOKENIZER_DIR $MODEL_REPO ``` --- ## Step 5: Convert Model to Unified Checkpoint ```bash pip install -r /app/tensorrt_llm/examples/llama/requirements.txt --no-cache-dir python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir ${MODEL_DIR} --output_dir ${UNIFIED_CKPT} --dtype float16 ``` --- ## Step 6: Build TensorRT Engine ```bash trtllm-build --checkpoint_dir ${UNIFIED_CKPT} \ --remove_input_padding enable --gpt_attention_plugin float16 \ --context_fmha enable --gemm_plugin float16 \ --output_dir ${ENGINE_DIR} --paged_kv_cache enable \ --max_batch_size 64 ``` --- ## Step 7: Copy Tokenizer Files ```bash cp $MODEL_DIR/{tokenizer.json,tokenizer_config.json,special_tokens_map.json,config.json} $TOKENIZER_DIR ``` --- ## Step 8: (Optional) Test Locally Run inference directly to verify the engine: ```bash python /app/tensorrt_llm/examples/run.py --max_output_len 500 \ --tokenizer_dir ${TOKENIZER_DIR} --engine_dir ${ENGINE_DIR} \ --input_text "Explain quantum computing in simple terms." ``` --- ## Step 9: Configure Backend Repository Clone backend and setup inflight batcher: ```bash git clone https://github.com/triton-inference-server/tensorrtllm_backend.git $CWD/tensorrtllm_backend cd $CWD/tensorrtllm_backend git checkout v0.10.0 cp -r all_models/inflight_batcher_llm/* $MODEL_REPO ``` Update configs: ```bash export MOUNT_PATH=/mnt/models export ENGINE_PATH=$MOUNT_PATH/tensorrt_llm/1/engine export TOKENIZER_PATH=$MOUNT_PATH/tensorrt_llm/1/tokenizer_dir python tools/fill_template.py -i ${MODEL_REPO}/tensorrt_llm/config.pbtxt \ triton_backend:tensorrtllm,engine_dir:${ENGINE_PATH},tokenizer_dir:${TOKENIZER_PATH},max_batch_size:64 ``` --- ## Step 10: Upload Model Repository ```bash mc cp -r ${MODEL_REPO}/* my-llama3-model-repo/ ``` --- ## Step 11: Create Inference Endpoint 1. Create **Model Endpoint** 2. Choose **TensorRT-LLM Framework** 3. Select your model repository 4. Choose runtime version **v0.10.0** 5. Pick GPU plan and launch > The endpoint will enter *Running* once engine initialization completes. --- ## Step 12: Test the Endpoint ### Health Check ```bash curl -H "Authorization: Bearer " \ /v2/health/ready ``` ### Inference ```bash curl -X POST /v2/models/ensemble/generate \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{ "text_input": "Explain AI alignment.", "max_tokens": 500, "temperature": 0.7 }' ``` --- ## Example Response ```json { "model_name": "ensemble", "sequence_id": 0, "text_output": "AI alignment refers to designing systems whose goals match human values and intentions." } ``` --- ## Notes & Tips * ✅ Use GPU A100 or H100 for best performance * ✅ Upload model repository to object storage for fast startup * ✅ Check logs for `engine loaded successfully` message * ✅ Use smaller `max_tokens` for quick tests --- Llama 3 inference using TensorRT-LLM is now live and ready for production deployment! ---