# Deploy Inference for Meta Llama 2 7B Deploy Meta’s **Llama 2 (7B)** model seamlessly using prebuilt containers or your custom model weights. This guide covers defining the model, setting up your environment, downloading weights, uploading to storage, and creating an inference endpoint. --- ## Overview This tutorial includes: 1. Defining a model in the dashboard 2. Downloading Meta’s Llama 2-7B model from Hugging Face 3. Uploading the model to Object Storage (EOS) 4. Creating an inference endpoint for API access Requirements: * GPU-enabled compute plan (recommended: A100 80GB) * Access to [Meta Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf) * Hugging Face token --- ## Step 1: Define Model in Dashboard 1. Log in to the [AI Platform](https://tir.e2enetworks.com). 2. Select your **project**. 3. Go to the **Models** section and click **Create Model**. 4. Provide a model name (e.g., `meta-llama2-7b-chat`). 5. Choose **Model Type** as *Custom* or *PyTorch*. 6. Click **Create**. 7. The system will generate an **Object Storage (EOS)** bucket for your model. > **Note:** EOS offers an S3-compatible interface. You’ll use **MinIO CLI** to upload content. 8. Copy the **Setup Host** command from the *Setup MinIO CLI* tab — you’ll need it later to configure your CLI tool. > **Tip:** If you forget to copy it, revisit the model details page anytime to retrieve it. --- ## Step 2: Launch a Instance 1. In the **Dashboard**, navigate to **Instance(Nodes)**. 2. Launch a new Instance(Node) using the **Transformers** or **PyTorch** image. 3. Choose a GPU plan (A100 80GB recommended). 4. Click **Launch Instance(Node)** to open **JupyterLab**. 5. Open a new terminal within JupyterLab. 6. Paste and run the copied **MinIO setup command** from Step 1. 7. Once configured, your MinIO CLI (`mc`) will be ready for uploads. --- ## Step 3: Download Llama 2 7B Model from Hugging Face 1. Create a new Instance(Node) (`llama2-setup.ipynb`). 2. Add your Hugging Face token: ```bash export HUGGING_FACE_HUB_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXX ``` 3. Run the following Python code to download the model: ```python from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline import torch, transformers model_id = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map='auto') generator = pipeline( "text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.float16, device_map="auto" ) prompt = "It is said that life is beautiful when" result = generator(prompt, do_sample=True, top_k=10, max_length=200) print(result[0]['generated_text']) ``` > **Note:** Install dependencies if not preinstalled: > > ```bash > pip install transformers torch > ``` > **Tip:** Llama 2 base models are completion-based. Use sentence prompts instead of direct questions for meaningful outputs. --- ## Step 4: Upload Model to EOS 1. Locate your downloaded model path: ```bash cd $HOME/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/ ``` 2. Copy the upload command from the *Setup MinIO CLI* tab in your dashboard. 3. Replace `` with `*` to upload all contents: ```bash mc cp -r * meta-llama2-7b/meta-llama2-7b-weights ``` > **Note:** Directory names may vary slightly. Use `ls $HOME/.cache/huggingface/hub/` to confirm the correct folder. --- ## Step 5: Create the Endpoint Once the model is uploaded, create an inference endpoint to serve API requests. ### Using Prebuilt Container 1. Go to the **Model Endpoints** section. 2. Click **Create Endpoint**. 3. Choose **Llama 2 7B** model card. 4. Select a GPU plan (e.g., `A100 80GB`, disk: 20GB+). 5. Link the EOS model you just uploaded. 6. Click **Create** and monitor logs until deployment completes. > **Tip:** Use prebuilt containers to skip API handler creation — they’re optimized for inference-ready execution. --- ## Step 6: Test the Endpoint Once the endpoint status is *Ready*, run a test request: ```bash curl -X POST /v2/models/llama2-7b/generate \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{ "text_input": "Explain the theory of relativity in simple words.", "max_tokens": 250, "temperature": 0.7 }' ``` --- ## Troubleshooting | Issue | Cause | Solution | | --------------------------- | --------------------------------- | --------------------------------------------------------- | | `403 Forbidden` | Hugging Face token missing access | Ensure access granted and re-login with token. | | Endpoint stuck in *Pending* | Insufficient GPU or large model | Use larger GPU plan like A100 or H100. | | Slow responses | Large token size | Lower `max_tokens` or switch to high-end GPU. | | Upload fails | MinIO misconfiguration | Re-run host setup from *Model Details → Setup MinIO CLI*. | --- ## Best Practices * ✅ Validate `HF_TOKEN` before launching. * ✅ Use **Link with Model Repository** for version control. * ✅ Prefer GPU A100 or H100 for efficiency. * ✅ Keep batch size small for initial runs. * ✅ Clean up unused endpoints to optimize GPU usage. --- With these steps, your **Meta Llama 2 7B** inference endpoint will be ready for real-time text generation and model serving! ---