Deploy Inference for Meta Llama 3 8B-IT
Deploy the Meta Llama 3 8B Instruct model easily using prebuilt containers or custom model weights. This guide walks you through setting up endpoints, downloading models, configuring environments, and running inference.
Overview
This tutorial covers:
- Creating a model endpoint using prebuilt containers
- Deploying with custom model weights
- Understanding supported parameters for inference
Requirements:
- A GPU-enabled compute plan
- Access to Meta Llama 3-8B
- Hugging Face read token
Step 1: Create a Model Endpoint
- Log in to the AI Platform and open your project.
- Go to the Model Endpoints section.
- Click Create Endpoint and choose Llama 3 8B-IT from model cards.
- Under Download Source, select Hugging Face.
- Choose a GPU plan, set replicas, and name your endpoint (e.g.,
llama3-infer).
Tip: Choose Link with Model Repository if using custom weights.
Step 2: Set Environment Variables
Add the following environment variables:
| Variable | Description |
|---|---|
HF_TOKEN | Your Hugging Face read token |
Note: Llama 3 models are gated. Ensure you’ve requested access and been approved on Hugging Face.
Step 3: Generate API Token
- Go to API Tokens section.
- Click Create Token (or use an existing one).
- Copy your Auth Token; this will be used in your inference request.
Step 4: Test Inference Endpoint
Once your endpoint is ready, test it using a sample curl request:
curl -X POST <endpoint-url>/v2/models/llama-3-8b-it/generate \
-H "Authorization: Bearer <auth-token>" \
-H "Content-Type: application/json" \
-d '{
"text_input": "Write a short Python function to reverse a string.",
"max_tokens": 200,
"temperature": 0.7
}'
Creating Endpoint with Custom Model Weights
If you wish to deploy fine-tuned weights:
- Download
meta-llama/Meta-Llama-3-8B-Instructfrom Hugging Face. - Upload the model to the Model Repository (EOS bucket).
- Create an endpoint and Link with Model Repository.
Define Model Repository
- Go to Model Repository → Create Model.
- Select Model Type: Custom.
- Copy the MinIO Setup Host command.
- Use MinIO CLI to connect and upload your model weights.
Tip: You can reaccess setup instructions anytime from Model Details → Setup MinIO CLI tab.
Download and Upload Weights
Run the following commands inside your GPU instance:
huggingface-cli login
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("def factorial(num: int):", max_length=100)
Upload the model to your repository:
cd $HOME/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/
mc cp -r * my-llama3-model-bucket/
Step 5: Create Endpoint with Repository Link
- Go to Model Endpoints → Create Endpoint.
- Choose Llama 3 8B-IT model card.
- Select Link with Model Repository and choose your uploaded model.
- Set environment variables (HF_TOKEN).
- Launch endpoint and monitor logs until it’s ready.
Step 6: Run Inference on Custom Model
After the endpoint is active, make requests using your API token:
curl -X POST <endpoint-url>/v2/models/llama-3-8b-it/generate \
-H "Authorization: Bearer <auth-token>" \
-H "Content-Type: application/json" \
-d '{
"text_input": "Summarize quantum computing in 50 words.",
"max_tokens": 250,
"temperature": 0.6
}'
Supported Parameters
| Field | Description | Type |
|---|---|---|
text_input | Input prompt text | string |
max_tokens | Maximum tokens in output | int |
temperature | Controls randomness | float |
top_k | Top-k sampling | int |
top_p | Nucleus sampling | float |
repetition_penalty | Penalize repetition | float |
stop_words | Stop generation tokens | list |
return_log_probs | Return token log probabilities | bool |
Notes & Tips
- ✅ Ensure Hugging Face token has access to the Llama 3 gated model
- ✅ Use GPU A100 or H100 for faster inference
- ✅ Monitor logs for readiness
- ✅ Use lower max_tokens for quicker test responses
Your Meta Llama 3 8B-IT endpoint is now ready to serve production inference requests!
Troubleshooting
Common Issues
| Issue | Cause | Solution |
|---|---|---|
403 Forbidden or Access Denied | Hugging Face token doesn’t have access | Ensure you have requested and received access to the Llama 3 model. Recreate token if needed. |
| Endpoint stuck in Pending | GPU unavailable or model too large for selected plan | Use a higher GPU configuration such as A100 or H100, or reduce replicas. |
| Slow inference | High max_tokens or small GPU plan | Lower max_tokens or upgrade GPU plan. |
| Empty or truncated responses | Token limits or temperature settings | Increase max_tokens or adjust temperature to 0.7–1.0. |
Useful Commands
-
View endpoint logs:
tail -f /var/log/endpoint.log -
Validate endpoint status:
curl -H "Authorization: Bearer <auth-token>" <endpoint-url>/v2/health/ready
Best Practices
- ✅ Always validate your environment variables before deployment.
- ✅ Store your Hugging Face token securely — never hardcode in scripts.
- ✅ Keep batch sizes small for initial testing.
- ✅ Use Link with Model Repository for faster loading and version control.
- ✅ Clean up unused nodes/endpoints to save GPU credits.
With proper configuration, your Meta Llama 3 8B-IT inference setup will deliver optimized performance, stability, and scalability across production workloads.