# Launching Llama 3 Inference Using TensorRT-LLM

Deploy the **Llama 3 8B Instruct** model efficiently using NVIDIA’s **TensorRT-LLM backend** on **Triton Inference Server**. This guide walks you through every step — from building the engine to deploying and testing the model.

---

## Overview

This tutorial covers:

1. Preparing the environment and model access
2. Building a TensorRT engine
3. Configuring inference backend
4. Uploading to model repository
5. Creating an endpoint for inference
6. Making test API calls

You’ll need:

* A GPU-enabled compute instance
* Access to the [Meta Llama 3 model](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
* A Hugging Face access token

---

## Step 1: Get Model Access

1. Log in or sign up on [Hugging Face](https://huggingface.co/join).
2. Accept the **Meta Llama 3 Community License** on the model card.
3. Create a **read** token under [Hugging Face Settings → Access Tokens](https://huggingface.co/settings/tokens).
4. Save the token for later.

---

## Step 2: Create GPU Node

Launch a node with TensorRT-LLM environment:

```bash
# Recommended image
TensorRT-LLM Engine Builder v0.10.0
```

* Choose GPU plan (A100 preferred)
* Set Disk size ≥ 100 GB
* Launch Node → Open JupyterLab terminal once ready

> **Tip:** Run all commands in the JupyterLab terminal.

---

## Step 3: Download Model from Hugging Face

```bash
mkdir -p $CWD/model
pip install -U "huggingface_hub[cli]"
export HF_TOKEN=<your-hf-token>
export HF_HOME=$CWD
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
  --local-dir=$CWD/model --local-dir-use-symlinks=False
```

---

## Step 4: Prepare Directories and Variables

```bash
export MODEL_DIR=$CWD/model
export UNIFIED_CKPT=$CWD/unified_ckpt
export ENGINE_DIR=$CWD/engine_dir
export TOKENIZER_DIR=$CWD/tokenizer_dir
export MODEL_REPO=$CWD/model_repository

mkdir -p $UNIFIED_CKPT $ENGINE_DIR $TOKENIZER_DIR $MODEL_REPO
```

---

## Step 5: Convert Model to Unified Checkpoint

```bash
pip install -r /app/tensorrt_llm/examples/llama/requirements.txt --no-cache-dir
python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
  --model_dir ${MODEL_DIR} --output_dir ${UNIFIED_CKPT} --dtype float16
```

---

## Step 6: Build TensorRT Engine

```bash
trtllm-build --checkpoint_dir ${UNIFIED_CKPT} \
  --remove_input_padding enable --gpt_attention_plugin float16 \
  --context_fmha enable --gemm_plugin float16 \
  --output_dir ${ENGINE_DIR} --paged_kv_cache enable \
  --max_batch_size 64
```

---

## Step 7: Copy Tokenizer Files

```bash
cp $MODEL_DIR/{tokenizer.json,tokenizer_config.json,special_tokens_map.json,config.json} $TOKENIZER_DIR
```

---

## Step 8: (Optional) Test Locally

Run inference directly to verify the engine:

```bash
python /app/tensorrt_llm/examples/run.py --max_output_len 500 \
  --tokenizer_dir ${TOKENIZER_DIR} --engine_dir ${ENGINE_DIR} \
  --input_text "Explain quantum computing in simple terms."
```

---

## Step 9: Configure Backend Repository

Clone backend and setup inflight batcher:

```bash
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git $CWD/tensorrtllm_backend
cd $CWD/tensorrtllm_backend
git checkout v0.10.0
cp -r all_models/inflight_batcher_llm/* $MODEL_REPO
```

Update configs:

```bash
export MOUNT_PATH=/mnt/models
export ENGINE_PATH=$MOUNT_PATH/tensorrt_llm/1/engine
export TOKENIZER_PATH=$MOUNT_PATH/tensorrt_llm/1/tokenizer_dir

python tools/fill_template.py -i ${MODEL_REPO}/tensorrt_llm/config.pbtxt \
  triton_backend:tensorrtllm,engine_dir:${ENGINE_PATH},tokenizer_dir:${TOKENIZER_PATH},max_batch_size:64
```

---

## Step 10: Upload Model Repository

```bash
mc cp -r ${MODEL_REPO}/* my-llama3-model-repo/
```

---

## Step 11: Create Inference Endpoint

1. Create **Model Endpoint**
2. Choose **TensorRT-LLM Framework**
3. Select your model repository
4. Choose runtime version **v0.10.0**
5. Pick GPU plan and launch

> The endpoint will enter *Running* once engine initialization completes.

---

## Step 12: Test the Endpoint

### Health Check

```bash
curl -H "Authorization: Bearer <auth-token>" \
  <endpoint-url>/v2/health/ready
```

### Inference

```bash
curl -X POST <endpoint-url>/v2/models/ensemble/generate \
  -H "Authorization: Bearer <auth-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "text_input": "Explain AI alignment.",
    "max_tokens": 500,
    "temperature": 0.7
  }'
```

---

## Example Response

```json
{
  "model_name": "ensemble",
  "sequence_id": 0,
  "text_output": "AI alignment refers to designing systems whose goals match human values and intentions."
}
```

---

## Notes & Tips

* ✅ Use GPU A100 or H100 for best performance
* ✅ Upload model repository to object storage for fast startup
* ✅ Check logs for `engine loaded successfully` message
* ✅ Use smaller `max_tokens` for quick tests

---

Llama 3 inference using TensorRT-LLM is now live and ready for production deployment!


---