# Deploy Inference for Meta Llama 3 8B-IT

Deploy the **Meta Llama 3 8B Instruct** model easily using prebuilt containers or custom model weights. This guide walks you through setting up endpoints, downloading models, configuring environments, and running inference.

---

## Overview

This tutorial covers:

1. Creating a model endpoint using prebuilt containers
2. Deploying with custom model weights
3. Understanding supported parameters for inference

Requirements:

* A GPU-enabled compute plan
* Access to [Meta Llama 3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
* Hugging Face read token

---

## Step 1: Create a Model Endpoint

1. Log in to the [AI Platform](https://tir.e2enetworks.com) and open your **project**.
2. Go to the **Model Endpoints** section.
3. Click **Create Endpoint** and choose **Llama 3 8B-IT** from model cards.
4. Under **Download Source**, select **Hugging Face**.
5. Choose a GPU plan, set replicas, and name your endpoint (e.g., `llama3-infer`).

> **Tip:** Choose **Link with Model Repository** if using custom weights.

---

## Step 2: Set Environment Variables

Add the following environment variables:

| Variable   | Description                  |
| ---------- | ---------------------------- |
| `HF_TOKEN` | Your Hugging Face read token |

> **Note:** Llama 3 models are gated. Ensure you’ve requested access and been approved on [Hugging Face](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).

---

## Step 3: Generate API Token

1. Go to **API Tokens** section.
2. Click **Create Token** (or use an existing one).
3. Copy your **Auth Token**; this will be used in your inference request.

---

## Step 4: Test Inference Endpoint

Once your endpoint is ready, test it using a sample `curl` request:

```bash
curl -X POST <endpoint-url>/v2/models/llama-3-8b-it/generate \
  -H "Authorization: Bearer <auth-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "text_input": "Write a short Python function to reverse a string.",
    "max_tokens": 200,
    "temperature": 0.7
  }'
```

---

## Creating Endpoint with Custom Model Weights

If you wish to deploy fine-tuned weights:

1. Download `meta-llama/Meta-Llama-3-8B-Instruct` from Hugging Face.
2. Upload the model to the **Model Repository** (EOS bucket).
3. Create an endpoint and **Link with Model Repository**.

### Define Model Repository

1. Go to **Model Repository → Create Model**.
2. Select **Model Type:** Custom.
3. Copy the **MinIO Setup Host** command.
4. Use MinIO CLI to connect and upload your model weights.

> **Tip:** You can reaccess setup instructions anytime from **Model Details → Setup MinIO CLI** tab.

### Download and Upload Weights

Run the following commands inside your GPU instance:

```bash
huggingface-cli login
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

pipe("def factorial(num: int):", max_length=100)
```

Upload the model to your repository:

```bash
cd $HOME/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/
mc cp -r * my-llama3-model-bucket/
```

---

## Step 5: Create Endpoint with Repository Link

1. Go to **Model Endpoints → Create Endpoint**.
2. Choose **Llama 3 8B-IT** model card.
3. Select **Link with Model Repository** and choose your uploaded model.
4. Set environment variables (HF_TOKEN).
5. Launch endpoint and monitor logs until it’s ready.

---

## Step 6: Run Inference on Custom Model

After the endpoint is active, make requests using your API token:

```bash
curl -X POST <endpoint-url>/v2/models/llama-3-8b-it/generate \
  -H "Authorization: Bearer <auth-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "text_input": "Summarize quantum computing in 50 words.",
    "max_tokens": 250,
    "temperature": 0.6
  }'
```

---

## Supported Parameters

| Field                | Description                    | Type   |
| -------------------- | ------------------------------ | ------ |
| `text_input`         | Input prompt text              | string |
| `max_tokens`         | Maximum tokens in output       | int    |
| `temperature`        | Controls randomness            | float  |
| `top_k`              | Top-k sampling                 | int    |
| `top_p`              | Nucleus sampling               | float  |
| `repetition_penalty` | Penalize repetition            | float  |
| `stop_words`         | Stop generation tokens         | list   |
| `return_log_probs`   | Return token log probabilities | bool   |

---

## Notes & Tips

* ✅ Ensure Hugging Face token has access to the Llama 3 gated model
* ✅ Use GPU A100 or H100 for faster inference
* ✅ Monitor logs for readiness
* ✅ Use lower max_tokens for quicker test responses

---

Your **Meta Llama 3 8B-IT** endpoint is now ready to serve production inference requests!

---

## Troubleshooting

### Common Issues

| Issue                              | Cause                                                | Solution                                                                                      |
| ---------------------------------- | ---------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| `403 Forbidden` or `Access Denied` | Hugging Face token doesn’t have access               | Ensure you have requested and received access to the Llama 3 model. Recreate token if needed. |
| Endpoint stuck in *Pending*        | GPU unavailable or model too large for selected plan | Use a higher GPU configuration such as A100 or H100, or reduce replicas.                      |
| Slow inference                     | High `max_tokens` or small GPU plan                  | Lower `max_tokens` or upgrade GPU plan.                                                       |
| Empty or truncated responses       | Token limits or temperature settings                 | Increase `max_tokens` or adjust `temperature` to 0.7–1.0.                                     |

### Useful Commands

* View endpoint logs:

  ```bash
  tail -f /var/log/endpoint.log
  ```
* Validate endpoint status:

  ```bash
  curl -H "Authorization: Bearer <auth-token>" <endpoint-url>/v2/health/ready
  ```

---

## Best Practices

* ✅ Always validate your environment variables before deployment.
* ✅ Store your Hugging Face token securely — never hardcode in scripts.
* ✅ Keep batch sizes small for initial testing.
* ✅ Use **Link with Model Repository** for faster loading and version control.
* ✅ Clean up unused nodes/endpoints to save GPU credits.

---

With proper configuration, your **Meta Llama 3 8B-IT** inference setup will deliver optimized performance, stability, and scalability across production workloads.


---