Skip to main content

Deploy Inference for Meta Llama 3 8B-IT

Deploy the Meta Llama 3 8B Instruct model easily using prebuilt containers or custom model weights. This guide walks you through setting up endpoints, downloading models, configuring environments, and running inference.


Overview

This tutorial covers:

  1. Creating a model endpoint using prebuilt containers
  2. Deploying with custom model weights
  3. Understanding supported parameters for inference

Requirements:

  • A GPU-enabled compute plan
  • Access to Meta Llama 3-8B
  • Hugging Face read token

Step 1: Create a Model Endpoint

  1. Log in to the AI Platform and open your project.
  2. Go to the Model Endpoints section.
  3. Click Create Endpoint and choose Llama 3 8B-IT from model cards.
  4. Under Download Source, select Hugging Face.
  5. Choose a GPU plan, set replicas, and name your endpoint (e.g., llama3-infer).

Tip: Choose Link with Model Repository if using custom weights.


Step 2: Set Environment Variables

Add the following environment variables:

VariableDescription
HF_TOKENYour Hugging Face read token

Note: Llama 3 models are gated. Ensure you’ve requested access and been approved on Hugging Face.


Step 3: Generate API Token

  1. Go to API Tokens section.
  2. Click Create Token (or use an existing one).
  3. Copy your Auth Token; this will be used in your inference request.

Step 4: Test Inference Endpoint

Once your endpoint is ready, test it using a sample curl request:

curl -X POST <endpoint-url>/v2/models/llama-3-8b-it/generate \
-H "Authorization: Bearer <auth-token>" \
-H "Content-Type: application/json" \
-d '{
"text_input": "Write a short Python function to reverse a string.",
"max_tokens": 200,
"temperature": 0.7
}'

Creating Endpoint with Custom Model Weights

If you wish to deploy fine-tuned weights:

  1. Download meta-llama/Meta-Llama-3-8B-Instruct from Hugging Face.
  2. Upload the model to the Model Repository (EOS bucket).
  3. Create an endpoint and Link with Model Repository.

Define Model Repository

  1. Go to Model Repository → Create Model.
  2. Select Model Type: Custom.
  3. Copy the MinIO Setup Host command.
  4. Use MinIO CLI to connect and upload your model weights.

Tip: You can reaccess setup instructions anytime from Model Details → Setup MinIO CLI tab.

Download and Upload Weights

Run the following commands inside your GPU instance:

huggingface-cli login
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

pipe("def factorial(num: int):", max_length=100)

Upload the model to your repository:

cd $HOME/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/
mc cp -r * my-llama3-model-bucket/

  1. Go to Model Endpoints → Create Endpoint.
  2. Choose Llama 3 8B-IT model card.
  3. Select Link with Model Repository and choose your uploaded model.
  4. Set environment variables (HF_TOKEN).
  5. Launch endpoint and monitor logs until it’s ready.

Step 6: Run Inference on Custom Model

After the endpoint is active, make requests using your API token:

curl -X POST <endpoint-url>/v2/models/llama-3-8b-it/generate \
-H "Authorization: Bearer <auth-token>" \
-H "Content-Type: application/json" \
-d '{
"text_input": "Summarize quantum computing in 50 words.",
"max_tokens": 250,
"temperature": 0.6
}'

Supported Parameters

FieldDescriptionType
text_inputInput prompt textstring
max_tokensMaximum tokens in outputint
temperatureControls randomnessfloat
top_kTop-k samplingint
top_pNucleus samplingfloat
repetition_penaltyPenalize repetitionfloat
stop_wordsStop generation tokenslist
return_log_probsReturn token log probabilitiesbool

Notes & Tips

  • ✅ Ensure Hugging Face token has access to the Llama 3 gated model
  • ✅ Use GPU A100 or H100 for faster inference
  • ✅ Monitor logs for readiness
  • ✅ Use lower max_tokens for quicker test responses

Your Meta Llama 3 8B-IT endpoint is now ready to serve production inference requests!


Troubleshooting

Common Issues

IssueCauseSolution
403 Forbidden or Access DeniedHugging Face token doesn’t have accessEnsure you have requested and received access to the Llama 3 model. Recreate token if needed.
Endpoint stuck in PendingGPU unavailable or model too large for selected planUse a higher GPU configuration such as A100 or H100, or reduce replicas.
Slow inferenceHigh max_tokens or small GPU planLower max_tokens or upgrade GPU plan.
Empty or truncated responsesToken limits or temperature settingsIncrease max_tokens or adjust temperature to 0.7–1.0.

Useful Commands

  • View endpoint logs:

    tail -f /var/log/endpoint.log
  • Validate endpoint status:

    curl -H "Authorization: Bearer <auth-token>" <endpoint-url>/v2/health/ready

Best Practices

  • ✅ Always validate your environment variables before deployment.
  • ✅ Store your Hugging Face token securely — never hardcode in scripts.
  • ✅ Keep batch sizes small for initial testing.
  • ✅ Use Link with Model Repository for faster loading and version control.
  • ✅ Clean up unused nodes/endpoints to save GPU credits.

With proper configuration, your Meta Llama 3 8B-IT inference setup will deliver optimized performance, stability, and scalability across production workloads.