# Deploy Inference for Meta Llama 2 7B

Deploy Meta’s **Llama 2 (7B)** model seamlessly using prebuilt containers or your custom model weights. This guide covers defining the model, setting up your environment, downloading weights, uploading to storage, and creating an inference endpoint.

---

## Overview

This tutorial includes:

1. Defining a model in the dashboard
2. Downloading Meta’s Llama 2-7B model from Hugging Face
3. Uploading the model to Object Storage (EOS)
4. Creating an inference endpoint for API access

Requirements:

* GPU-enabled compute plan (recommended: A100 80GB)
* Access to [Meta Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
* Hugging Face token

---

## Step 1: Define Model in Dashboard

1. Log in to the [AI Platform](https://tir.e2enetworks.com).
2. Select your **project**.
3. Go to the **Models** section and click **Create Model**.
4. Provide a model name (e.g., `meta-llama2-7b-chat`).
5. Choose **Model Type** as *Custom* or *PyTorch*.
6. Click **Create**.
7. The system will generate an **Object Storage (EOS)** bucket for your model.

> **Note:** EOS offers an S3-compatible interface. You’ll use **MinIO CLI** to upload content.

8. Copy the **Setup Host** command from the *Setup MinIO CLI* tab — you’ll need it later to configure your CLI tool.

> **Tip:** If you forget to copy it, revisit the model details page anytime to retrieve it.

---

## Step 2: Launch a Instance

1. In the **Dashboard**, navigate to **Instance(Nodes)**.
2. Launch a new Instance(Node) using the **Transformers** or **PyTorch** image.
3. Choose a GPU plan (A100 80GB recommended).
4. Click **Launch Instance(Node)** to open **JupyterLab**.
5. Open a new terminal within JupyterLab.
6. Paste and run the copied **MinIO setup command** from Step 1.
7. Once configured, your MinIO CLI (`mc`) will be ready for uploads.

---

## Step 3: Download Llama 2 7B Model from Hugging Face

1. Create a new Instance(Node) (`llama2-setup.ipynb`).
2. Add your Hugging Face token:

   ```bash
   export HUGGING_FACE_HUB_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXX
   ```
3. Run the following Python code to download the model:

   ```python
   from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
   import torch, transformers

   model_id = "meta-llama/Llama-2-7b-chat-hf"
   tokenizer = AutoTokenizer.from_pretrained(model_id)
   model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map='auto')

   generator = pipeline(
       "text-generation",
       model=model,
       tokenizer=tokenizer,
       torch_dtype=torch.float16,
       device_map="auto"
   )

   prompt = "It is said that life is beautiful when"
   result = generator(prompt, do_sample=True, top_k=10, max_length=200)
   print(result[0]['generated_text'])
   ```

> **Note:** Install dependencies if not preinstalled:
>
> ```bash
> pip install transformers torch
> ```

> **Tip:** Llama 2 base models are completion-based. Use sentence prompts instead of direct questions for meaningful outputs.

---

## Step 4: Upload Model to EOS

1. Locate your downloaded model path:

   ```bash
   cd $HOME/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/
   ```
2. Copy the upload command from the *Setup MinIO CLI* tab in your dashboard.
3. Replace `<MODEL_NAME>` with `*` to upload all contents:

   ```bash
   mc cp -r * meta-llama2-7b/meta-llama2-7b-weights
   ```

> **Note:** Directory names may vary slightly. Use `ls $HOME/.cache/huggingface/hub/` to confirm the correct folder.

---

## Step 5: Create the Endpoint

Once the model is uploaded, create an inference endpoint to serve API requests.

### Using Prebuilt Container

1. Go to the **Model Endpoints** section.
2. Click **Create Endpoint**.
3. Choose **Llama 2 7B** model card.
4. Select a GPU plan (e.g., `A100 80GB`, disk: 20GB+).
5. Link the EOS model you just uploaded.
6. Click **Create** and monitor logs until deployment completes.

> **Tip:** Use prebuilt containers to skip API handler creation — they’re optimized for inference-ready execution.

---

## Step 6: Test the Endpoint

Once the endpoint status is *Ready*, run a test request:

```bash
curl -X POST <endpoint-url>/v2/models/llama2-7b/generate \
  -H "Authorization: Bearer <auth-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "text_input": "Explain the theory of relativity in simple words.",
    "max_tokens": 250,
    "temperature": 0.7
  }'
```

---

## Troubleshooting

| Issue                       | Cause                             | Solution                                                  |
| --------------------------- | --------------------------------- | --------------------------------------------------------- |
| `403 Forbidden`             | Hugging Face token missing access | Ensure access granted and re-login with token.            |
| Endpoint stuck in *Pending* | Insufficient GPU or large model   | Use larger GPU plan like A100 or H100.                    |
| Slow responses              | Large token size                  | Lower `max_tokens` or switch to high-end GPU.             |
| Upload fails                | MinIO misconfiguration            | Re-run host setup from *Model Details → Setup MinIO CLI*. |

---

## Best Practices

* ✅ Validate `HF_TOKEN` before launching.
* ✅ Use **Link with Model Repository** for version control.
* ✅ Prefer GPU A100 or H100 for efficiency.
* ✅ Keep batch size small for initial runs.
* ✅ Clean up unused endpoints to optimize GPU usage.

---

With these steps, your **Meta Llama 2 7B** inference endpoint will be ready for real-time text generation and model serving!


---