Skip to main content

Deploy Inference for Meta Llama 2 7B

Deploy Meta’s Llama 2 (7B) model seamlessly using prebuilt containers or your custom model weights. This guide covers defining the model, setting up your environment, downloading weights, uploading to storage, and creating an inference endpoint.


Overview

This tutorial includes:

  1. Defining a model in the dashboard
  2. Downloading Meta’s Llama 2-7B model from Hugging Face
  3. Uploading the model to Object Storage (EOS)
  4. Creating an inference endpoint for API access

Requirements:

  • GPU-enabled compute plan (recommended: A100 80GB)
  • Access to Meta Llama 2
  • Hugging Face token

Step 1: Define Model in Dashboard

  1. Log in to the AI Platform.
  2. Select your project.
  3. Go to the Models section and click Create Model.
  4. Provide a model name (e.g., meta-llama2-7b-chat).
  5. Choose Model Type as Custom or PyTorch.
  6. Click Create.
  7. The system will generate an Object Storage (EOS) bucket for your model.

Note: EOS offers an S3-compatible interface. You’ll use MinIO CLI to upload content.

  1. Copy the Setup Host command from the Setup MinIO CLI tab — you’ll need it later to configure your CLI tool.

Tip: If you forget to copy it, revisit the model details page anytime to retrieve it.


Step 2: Launch a Instance

  1. In the Dashboard, navigate to Instance(Nodes).
  2. Launch a new Instance(Node) using the Transformers or PyTorch image.
  3. Choose a GPU plan (A100 80GB recommended).
  4. Click Launch Instance(Node) to open JupyterLab.
  5. Open a new terminal within JupyterLab.
  6. Paste and run the copied MinIO setup command from Step 1.
  7. Once configured, your MinIO CLI (mc) will be ready for uploads.

Step 3: Download Llama 2 7B Model from Hugging Face

  1. Create a new Instance(Node) (llama2-setup.ipynb).

  2. Add your Hugging Face token:

    export HUGGING_FACE_HUB_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXX
  3. Run the following Python code to download the model:

    from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
    import torch, transformers

    model_id = "meta-llama/Llama-2-7b-chat-hf"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map='auto')

    generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto"
    )

    prompt = "It is said that life is beautiful when"
    result = generator(prompt, do_sample=True, top_k=10, max_length=200)
    print(result[0]['generated_text'])

Note: Install dependencies if not preinstalled:

pip install transformers torch

Tip: Llama 2 base models are completion-based. Use sentence prompts instead of direct questions for meaningful outputs.


Step 4: Upload Model to EOS

  1. Locate your downloaded model path:

    cd $HOME/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/
  2. Copy the upload command from the Setup MinIO CLI tab in your dashboard.

  3. Replace <MODEL_NAME> with * to upload all contents:

    mc cp -r * meta-llama2-7b/meta-llama2-7b-weights

Note: Directory names may vary slightly. Use ls $HOME/.cache/huggingface/hub/ to confirm the correct folder.


Step 5: Create the Endpoint

Once the model is uploaded, create an inference endpoint to serve API requests.

Using Prebuilt Container

  1. Go to the Model Endpoints section.
  2. Click Create Endpoint.
  3. Choose Llama 2 7B model card.
  4. Select a GPU plan (e.g., A100 80GB, disk: 20GB+).
  5. Link the EOS model you just uploaded.
  6. Click Create and monitor logs until deployment completes.

Tip: Use prebuilt containers to skip API handler creation — they’re optimized for inference-ready execution.


Step 6: Test the Endpoint

Once the endpoint status is Ready, run a test request:

curl -X POST <endpoint-url>/v2/models/llama2-7b/generate \
-H "Authorization: Bearer <auth-token>" \
-H "Content-Type: application/json" \
-d '{
"text_input": "Explain the theory of relativity in simple words.",
"max_tokens": 250,
"temperature": 0.7
}'

Troubleshooting

IssueCauseSolution
403 ForbiddenHugging Face token missing accessEnsure access granted and re-login with token.
Endpoint stuck in PendingInsufficient GPU or large modelUse larger GPU plan like A100 or H100.
Slow responsesLarge token sizeLower max_tokens or switch to high-end GPU.
Upload failsMinIO misconfigurationRe-run host setup from Model Details → Setup MinIO CLI.

Best Practices

  • ✅ Validate HF_TOKEN before launching.
  • ✅ Use Link with Model Repository for version control.
  • ✅ Prefer GPU A100 or H100 for efficiency.
  • ✅ Keep batch size small for initial runs.
  • ✅ Clean up unused endpoints to optimize GPU usage.

With these steps, your Meta Llama 2 7B inference endpoint will be ready for real-time text generation and model serving!