# Deploy Inference for Meta Llama 3 8B-IT Deploy the **Meta Llama 3 8B Instruct** model easily using prebuilt containers or custom model weights. This guide walks you through setting up endpoints, downloading models, configuring environments, and running inference. --- ## Overview This tutorial covers: 1. Creating a model endpoint using prebuilt containers 2. Deploying with custom model weights 3. Understanding supported parameters for inference Requirements: * A GPU-enabled compute plan * Access to [Meta Llama 3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) * Hugging Face read token --- ## Step 1: Create a Model Endpoint 1. Log in to the [AI Platform](https://tir.e2enetworks.com) and open your **project**. 2. Go to the **Model Endpoints** section. 3. Click **Create Endpoint** and choose **Llama 3 8B-IT** from model cards. 4. Under **Download Source**, select **Hugging Face**. 5. Choose a GPU plan, set replicas, and name your endpoint (e.g., `llama3-infer`). > **Tip:** Choose **Link with Model Repository** if using custom weights. --- ## Step 2: Set Environment Variables Add the following environment variables: | Variable | Description | | ---------- | ---------------------------- | | `HF_TOKEN` | Your Hugging Face read token | > **Note:** Llama 3 models are gated. Ensure you’ve requested access and been approved on [Hugging Face](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). --- ## Step 3: Generate API Token 1. Go to **API Tokens** section. 2. Click **Create Token** (or use an existing one). 3. Copy your **Auth Token**; this will be used in your inference request. --- ## Step 4: Test Inference Endpoint Once your endpoint is ready, test it using a sample `curl` request: ```bash curl -X POST /v2/models/llama-3-8b-it/generate \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{ "text_input": "Write a short Python function to reverse a string.", "max_tokens": 200, "temperature": 0.7 }' ``` --- ## Creating Endpoint with Custom Model Weights If you wish to deploy fine-tuned weights: 1. Download `meta-llama/Meta-Llama-3-8B-Instruct` from Hugging Face. 2. Upload the model to the **Model Repository** (EOS bucket). 3. Create an endpoint and **Link with Model Repository**. ### Define Model Repository 1. Go to **Model Repository → Create Model**. 2. Select **Model Type:** Custom. 3. Copy the **MinIO Setup Host** command. 4. Use MinIO CLI to connect and upload your model weights. > **Tip:** You can reaccess setup instructions anytime from **Model Details → Setup MinIO CLI** tab. ### Download and Upload Weights Run the following commands inside your GPU instance: ```bash huggingface-cli login from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline import torch model_id = "meta-llama/Meta-Llama-3-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained(model_id) pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) pipe("def factorial(num: int):", max_length=100) ``` Upload the model to your repository: ```bash cd $HOME/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/ mc cp -r * my-llama3-model-bucket/ ``` --- ## Step 5: Create Endpoint with Repository Link 1. Go to **Model Endpoints → Create Endpoint**. 2. Choose **Llama 3 8B-IT** model card. 3. Select **Link with Model Repository** and choose your uploaded model. 4. Set environment variables (HF_TOKEN). 5. Launch endpoint and monitor logs until it’s ready. --- ## Step 6: Run Inference on Custom Model After the endpoint is active, make requests using your API token: ```bash curl -X POST /v2/models/llama-3-8b-it/generate \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{ "text_input": "Summarize quantum computing in 50 words.", "max_tokens": 250, "temperature": 0.6 }' ``` --- ## Supported Parameters | Field | Description | Type | | -------------------- | ------------------------------ | ------ | | `text_input` | Input prompt text | string | | `max_tokens` | Maximum tokens in output | int | | `temperature` | Controls randomness | float | | `top_k` | Top-k sampling | int | | `top_p` | Nucleus sampling | float | | `repetition_penalty` | Penalize repetition | float | | `stop_words` | Stop generation tokens | list | | `return_log_probs` | Return token log probabilities | bool | --- ## Notes & Tips * ✅ Ensure Hugging Face token has access to the Llama 3 gated model * ✅ Use GPU A100 or H100 for faster inference * ✅ Monitor logs for readiness * ✅ Use lower max_tokens for quicker test responses --- Your **Meta Llama 3 8B-IT** endpoint is now ready to serve production inference requests! --- ## Troubleshooting ### Common Issues | Issue | Cause | Solution | | ---------------------------------- | ---------------------------------------------------- | --------------------------------------------------------------------------------------------- | | `403 Forbidden` or `Access Denied` | Hugging Face token doesn’t have access | Ensure you have requested and received access to the Llama 3 model. Recreate token if needed. | | Endpoint stuck in *Pending* | GPU unavailable or model too large for selected plan | Use a higher GPU configuration such as A100 or H100, or reduce replicas. | | Slow inference | High `max_tokens` or small GPU plan | Lower `max_tokens` or upgrade GPU plan. | | Empty or truncated responses | Token limits or temperature settings | Increase `max_tokens` or adjust `temperature` to 0.7–1.0. | ### Useful Commands * View endpoint logs: ```bash tail -f /var/log/endpoint.log ``` * Validate endpoint status: ```bash curl -H "Authorization: Bearer " /v2/health/ready ``` --- ## Best Practices * ✅ Always validate your environment variables before deployment. * ✅ Store your Hugging Face token securely — never hardcode in scripts. * ✅ Keep batch sizes small for initial testing. * ✅ Use **Link with Model Repository** for faster loading and version control. * ✅ Clean up unused nodes/endpoints to save GPU credits. --- With proper configuration, your **Meta Llama 3 8B-IT** inference setup will deliver optimized performance, stability, and scalability across production workloads. ---