Deploy Inference Endpoint For Meta LLMA 2

In this tutorial, we will download Meta’s LLMA2 (7b) model and create an inference endpoint against it.

  • Download LLMA2-7b-Chat (by Meta) model from huggingface

  • Upload the model to Model Bucket (EOS)

  • Create an inference endpoint (model endpoint) in TIR to serve API requests

Step 1: Define a model in TIR Dashboard

Before we proceed with downloading or fine-tuning (optional) the model weights, let us first define a model in TIR dashboard.

  • Go to TIR Dashboard

  • Choose a project

  • Go to Model section

  • Click on Create Model

  • Enter a model name of your choosing (e.g. Meta-LLMA2-7b-Chat)a

  • Select Model Type as Custom or Pytorch

  • Click on CREATE

  • You will now see details of EOS (E2E Object Storage) bucket created for this model.

  • EOS Provides a S3 compatible API to upload or download content. We will be using Minio CLI in this tutorial.

  • Copy the Setup Host command from Setup Minio CLI tab to a notepad or leave it in the clipboard. We will soon use it to setup Minio CLI.

Note: In case you forget to copy the setup host command for Minio CLI, don’t worry. You can always go back to model details and get it again.

Step 2: Start a new Notebook

To work with the model weights, we will need to first need to download them to a local machine or a notebook instance.

  • In TIR Dashboard, Go to Notebooks

  • Launch a new Notebook with Transformers (or Pytorch) Image and a hardware plan (e.g. A10080). We recommand a GPU plan if you plan to test or fine-tune the model.

  • Click on the Notebook name or Launch Notebook option to start jupyter labs environment

  • In the jupyter labs, Click New Launcher and Select Terminal

  • Now, paste and run the command for setting up Minio CLI Host from Step 1

  • If the command works, you will have mc cli ready for uploading our model

Step 2: Download the LLMA2-7B-Chat (by Meta) model from notebook

Now, our EOS bucket is store the model weights. Let us download the weights from Hugging face.

  • Start a new notebook untitled.ipynb in jupyter labs

  • Add your huggingface API token to run the following command from a notebook cell. You will find the API token from account settings. If you dont prefer using API Token this way, alternatively you may run huggingface_login() in notebook cell.

    !export HUGGING_FACE_HUB_TOKEN=`hf_xxxx.......`
  • Run the following commands in download the model. The model will be downloaded by huggignface sdk in the $HOME/.cache folder

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import transformers
    import torch
    model = AutoModelForCausalLM.from_pretrained(self.model_local_path,
    tokenizer = AutoTokenizer.from_pretrained(self.model_local_path)
    pipeline  = transformers.pipeline(


If you face any issues running above code in the notebook cell, you may be missing required libraries. This may happen if you did not launch the notebook with transformers image. In such situation, you can install the required libraries below:

!pip install transformers torch
  • Let us run a simple inference to test the model.

    pipeline('It is said that life is beautiful when',


Since llma-7b-hf is a base model and not trained on intructions or chat, it is no capable of answering question. However, the model is trained for sentence completion. So instead of asking - What is life?, an appropriate input will be - it is said that life is.

Step 2: Upload the model to Model Bucket (EOS)

Now that the model works as expected, you can fine-tune it with your own data or choose to serve the model as-is. This tutorial assumes you are uploading the model as-is to create inference endpoint. In case you fine-tune the model, you can follow similar steps to upload the model to EOS bucket.

# go to the directory that has the huggingface model code.
cd $HOME/.cache/huggingface/hub/meta-llma2-7b-chat/snapshots
# push the contents of the folder to EOS bucket.
# Go to TIR Dashboard >> Models >> Select your model >> Copy the cp command from Setup Minio CLI tab.

# The copy command would look like this:
# mc cp -r <MODEL_NAME> llma-7b/llma-7b-323f3

# here we replace <MODEL_NAME> with '*' to upload all contents of snapshots folder

mc cp -r * llma-7b/llma-7b-323f3


The model directory name may be a little different (we assume it is meta-llma2-7b-chat). In case, this command does not work, list the directories in $HOME/.cache/huggingface/hub to identify the model directory

Step 3: Create an endpoint for our model When a model endpoint is created in TIR dashboard, in the background a model server is launched to serve the inference requests.

TIR platform supports a variety of model formats through pre-buit containers (e.g. pytorch, triton, meta/llma2).

For the scope of this tutorial, we will use pre-built container (LLMA-2-7B) for the model endpoint but you may choose to create your own custom container by following this tutorial .

In most cases, the pre-built container would work for your use case. The advantage is - you wont have to worry about building an API handler.

When you use pre-built containers, all you need to do is load your model weights (fine-tuned or not) to TIR Model’s EOS bucket and launch the endpoint. API handler will be automatically created for you.

Steps to create endpoint:

  • Go to TIR Dashboard

  • Go to Model Endpoints

  • Create a new Endpoint

  • Choose LLMA2-7B option

  • Pick a suitable GPU plan. we recommend A10080 and disk size of 20G

  • Select appropriate model (should be the EOS bucket that has your model weights)

  • Complete the endpoint creation

  • When your endpoint is ready, visit the Sample API request section to test your endpoint using curl.