Skip to main content

Deploy Inference Endpoint For Meta LLMA 2

In this tutorial, we will download Meta's LLMA2 (7b) model and create an inference endpoint against it.

Steps Overview:

  • Download the LLMA2-7b-Chat (by Meta) model from Huggingface.
  • Upload the model to the Model Bucket (EOS).
  • Create an inference endpoint (model endpoint) in TIR to serve API requests.

Step 1: Define a Model in TIR Dashboard

Before we proceed with downloading or fine-tuning (optional) the model weights, let's first define a model in the TIR dashboard.

  1. Go to the TIR AI Platform.

  2. Choose a project.

  3. Navigate to the Model section.

  4. Click on Create Model.

  5. Enter a model name of your choosing (e.g., Meta-LLMA2-7b-Chat).

  6. Select Model Type as Custom or Pytorch.

  7. Click on CREATE.

  8. You will now see the details of the EOS (E2E Object Storage) bucket created for this model.

    EOS provides an S3-compatible API to upload or download content. We will be using the MinIO CLI in this tutorial.

  9. Copy the Setup Host command from the Setup MinIO CLI tab to a notepad or leave it in the clipboard. We will soon use it to set up the MinIO CLI.


This markdown file outlines the basic steps to define a new model in the TIR Dashboard for LLMA 2 and provides instructions for setting up the environment to upload the model weights.

Note

In case you forget to copy the setup host command for Minio CLI, don't worry. You can always go back to model details and get it again.

Step 2: Start a New Notebook

To work with the model weights, we will first need to download them to a local machine or a notebook instance.

  1. In the TIR Dashboard, go to Notebooks.
  2. Launch a new notebook with a Transformers (or Pytorch) image and a hardware plan (e.g., A10080). We recommend using a GPU plan if you plan to test or fine-tune the model.
  3. Click on the notebook name or Launch Notebook option to start the Jupyter Labs environment.
  4. In Jupyter Labs, click New Launcher and select Terminal.
  5. Paste and run the command for setting up MinIO CLI Host from Step 1.
  6. If the command works, you will have the mc CLI ready for uploading the model.

Step 3: Download the LLMA2-7B-Chat (by Meta) model from the Notebook

Now that our EOS bucket is ready to store the model weights, let’s download the weights from Hugging Face.

  1. Start a new notebook named untitled.ipynb in Jupyter Labs.

  2. Add your Hugging Face API token by running the following command in a notebook cell. You can find the API token in your Account Settings on Hugging Face. If you prefer not to use an API token this way, alternatively, you can run huggingface_login() in a notebook cell.

    !export HUGGING_FACE_HUB_TOKEN=`hf_xxxx.......`

Step 4: Download the LLMA2-7B-Chat (by Meta) Model from the Notebook

Now that our EOS bucket is ready to store the model weights, let’s download the model from Hugging Face.

Run the following commands to download the model. The model will be downloaded by the Hugging Face SDK into the $HOME/.cache folder.

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

# Define the model path (you can replace 'self.model_local_path' with the actual path or URL)
model = AutoModelForCausalLM.from_pretrained(self.model_local_path,
trust_remote_code=True,
device_map='auto')

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(self.model_local_path)

# Create a pipeline for text generation
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
tokenizer=tokenizer,
device_map="auto",
)
Note

If you face any issues running above code in the notebook cell, you may be missing required libraries. This may happen if you did not launch the notebook with transformers image. In such situation, you can install the required libraries below:

Install Required Libraries and Run Simple Inference

Before running the inference, make sure the required libraries are installed. You can install the transformers and torch libraries by running the following command:

!pip install transformers torch
# Run text generation inference with the model
pipeline('It is said that life is beautiful when',
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
max_length=200)
Note

Since llma-7b-hf is a base model and not trained on intructions or chat, it is no capable of answering question. However, the model is trained for sentence completion. So instead of asking - What is life?, an appropriate input will be - it is said that life is.

Step 5: Upload the Model to Model Bucket (EOS)

Now that the model works as expected, you can fine-tune it with your own data or choose to serve the model as-is. This tutorial assumes you are uploading the model as-is to create an inference endpoint. In case you fine-tune the model, you can follow similar steps to upload the model to the EOS bucket.

Go to the directory that has the Hugging Face model

cd $HOME/.cache/huggingface/hub/meta-llma2-7b-chat/snapshots

Step 6: Push the Model to EOS Bucket

Now that you have navigated to the folder containing the model, you can upload its contents to the EOS bucket. Follow the steps below:

  1. Go to TIR Dashboard >> Models >> Select your model >> Copy the cp command from the Setup Minio CLI tab.

  2. The copy command would look like this:

    mc cp -r <MODEL_NAME> llma-7b/llma-7b-323f3
  3. Replace <MODEL_NAME> with * to upload all contents of the snapshots folder:

    mc cp -r * llma-7b/llma-7b-323f3

This command will push all the contents from the local folder to the EOS bucket for your model.

Note

The model directory name may be a little different (we assume it is meta-llma2-7b-chat). In case, this command does not work, list the directories in $HOME/.cache/huggingface/hub to identify the model directory .

Step 7: Create an Endpoint for Our Model

When a model endpoint is created in the TIR dashboard, a model server is launched in the background to serve the inference requests.

TIR platform supports a variety of model formats through pre-built containers (e.g., PyTorch, Triton, Meta/LLMA2). For this tutorial, we will use the pre-built container (LLMA-2-7B) for the model endpoint, but you may choose to create your own custom container by following this tutorial.

Why Use Pre-built Containers?

Using pre-built containers makes things easier as you won’t have to worry about building an API handler. When you use pre-built containers, all you need to do is load your model weights (fine-tuned or not) to the TIR Model's EOS bucket, and the endpoint will be automatically created with an API handler for you.

Steps to Create the Endpoint:

  1. Go to TIR Dashboard
  2. Go to Model Endpoints
  3. Create a New Endpoint
  4. Choose LLMA2-7B Option
  5. Pick a Suitable GPU Plan: We recommend A10080 with a disk size of 20GB
  6. Select the Appropriate Model: This should be the EOS bucket that contains your model weights
  7. Complete the Endpoint Creation
  8. Once the endpoint is ready, visit the Sample API Request section to test your endpoint using curl.

Now you can start making inference requests through the API.