Skip to main content

Deploy Inference for Meta LLAMA 3 8B-IT

In this tutorial, we will create a model endpoint for the LLAMA-3-8B-IT model.

The tutorial will mainly focus on the following:

Model Endpoint creation for LLAMA-3-8B-IT using prebuilt container

When a model endpoint is created in the TIR dashboard, a model server is launched in the background to serve the inference requests.

TIR platform supports a variety of model formats through pre-built containers (e.g. PyTorch, Triton, LLAMA-2-7b-chat, GEMMA-7B-IT, etc.).

For the scope of this tutorial, we will use the pre-built container LLAMA-3-8B-IT for the model endpoint. However, you can also create your own custom container by following this tutorial.

In most cases, the pre-built container will work for your use case. The advantage is that you won't have to worry about building an API handler—the API handler will be automatically created for you.

Step 1: Create a Model Endpoint

  • Go to TIR AI Platform

  • Choose a project

  • Go to Model Endpoints section

    Model Endpoints

  • Create a new Endpoint

    Create Endpoint

  • Choose LLAMA-3-8B-IT model card

    LLAMA-3-8B-IT Model Card

  • Choose Download from Hugging Face

    Note: Use Link with Model Repository if you want to use custom model weights or fine-tune the model. Refer to the section on Model Endpoint creation for LLAMA-3-8B-IT, with custom model weights for more details.

    Download from Hugging Face

  • Pick a suitable GPU plan of your choice and set the replicas.

  • Add Endpoint Name (e.g., llama-v3-endpoint).

Setting Environment Variables

Environment Variables

Compulsory Environment Variables
Note: LLAMA-3-8B-IT is a Gated Model; you will need permission to access it.

Follow these steps to get access to the model:

  • Visit Meta Llama 3-8B.

  • Log in, complete the form, and click submit to "Request Access."

    Request Access

  • Go to Account Settings > Access Tokens > Create New API Token once approved.

    Access Tokens

  • Copy the API token.

    • HF_TOKEN: Paste the API token key.
  • Complete the endpoint creation.

  • Model endpoint creation might take a few minutes. You can monitor endpoint creation through logs in the log section.

    Logs


Step 2: Generate your API_TOKEN

The model endpoint API requires a valid auth token, for which you'll need to perform further steps. So, let's generate one.

  • Go to the API Tokens section under the project and click on Create Token. You can also use an existing token if one has already been created.

    Create Token

  • Once created, you'll be able to see the list of API Tokens containing the API Key and Auth Token. You will need this Auth Token in the next step.

    API Token


Step 3: Inferring Request

  • When your endpoint is ready, visit the Sample API request section to test your endpoint using curl (replace $AUTH_TOKEN with the token generated in the previous step).

    Sample API Request

  • You can go through Supported Parameters for inference requests that can be adjusted.

Creating Model Endpoint with Custom Model Weights

To create an inference request against the LLAMA-3-8B-IT model with custom model weights, follow these steps:

  1. Download the model:

    • Download the model meta-llama/Meta-Llama-3-8B-Instruct from Hugging Face.
  2. Upload the model to the Model Repository (EOS bucket).

  3. Create an Inference Endpoint (Model Endpoint) in TIR to serve API requests.

Step 1: Create a Model Repository

Before proceeding with downloading or fine-tuning (optional) the model weights, define the model in the TIR dashboard.

  1. Go to TIR AI Platform.

  2. Choose a project.

  3. Navigate to Model Repository.

    Model Repository

  4. Click on Create Model.

    Create Model

  5. Enter a model name of your choosing (e.g., tutorial-llama-3-8b-it).

  6. Select Model Type as Custom.

  7. Click on CREATE.

    Model Repo Details

  • After creation, you will see details of the EOS (E2E Object Storage) bucket for the model.
  • EOS provides a S3 compatible API to upload or download content. We will be using MinIO CLI for this tutorial.
  • Copy the Setup Host command from the Setup MinIO CLI tab to a notepad or leave it in the clipboard. You will use this command to set up the MinIO CLI later.
Note

In case you forget to copy the setup host command for MinIO CLI, don't worry. You can always go back to model details and get it again. .. image:: ../images/inference-llama-3-8b-it-repo-setup.png

Deploy Inference for LLAMA 3 8B-IT - Download Model Weights

Step 2: Start a New Notebook

To work with the model weights, we will need to first download them to a local machine or a notebook instance.

  1. In the TIR Dashboard, go to Nodes.

    Nodes

  2. Launch a new Node with a Transformers (or PyTorch) Image and a hardware plan (e.g., A10080). (Use GPU plan to test or fine-tune the model.)

    Launch Node

  3. Click on the Lab URL to start the Jupyter Labs environment.

    Jupyter Labs

  4. In the Jupyter Labs environment, click Terminal.

    Jupyter Terminal

  5. Paste and run the command to set up the MinIO CLI Host from Step 1.

    MinIO CLI Setup

  6. Now, mc CLI is ready for uploading the model.

    mc CLI Setup

Step 3: Login to Hugging Face CLI to Access LLAMA-3-8B-IT (Gated) Models

  1. Copy the Access Token you used when creating a model endpoint.

  2. Inside the Terminal, run the following command to log in:

    huggingface-cli login

Step 4: Download the LLAMA-3-8B-IT Model from Hugging Face

Now, our EOS bucket will store the model weights. Let's download the weights from Hugging Face.

  1. Start a new notebook, tutorial-llama-3-8b-it.ipynb, in Jupyter Labs.

    Start Jupyter Notebook

  2. Run the following commands to download the model. The model will be downloaded by the Hugging Face SDK in the $HOME/.cache folder.

    from transformers import AutoTokenizer
    import transformers
    import torch

    model = "meta-llama/Meta-Llama-3-8B-Instruct"

    tokenizer = AutoTokenizer.from_pretrained(model)

    pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
    tokenizer=tokenizer)
Note

If you face any issues running the above code in the notebook cell, you may be missing the required libraries. This may happen if you did not launch the notebook with the Transformers image.

Run a Simple Inference to Test the Model

Let us run a simple inference to test the model.

prompt = "def factorial(num: int):"

sequences = pipeline(prompt,
do_sample=True,
top_k=10,
temperature=0.1,
top_p=0.95,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=200
)
Note

All the supported parameters are listed in Supported Parameters <#supported-parameters>__ .. image:: ../images/inference-llama-3-8b-node-test-inference.png

Step 5: Upload the Model to Model Bucket (EOS)

Now that the model works as expected, you can fine-tune it with your data or choose to serve the model as-is. This tutorial assumes you are uploading the model as-is to create an inference endpoint. If you fine-tune the model, you can follow similar steps to upload the model to the model repositories EOS bucket.

# Go to the directory  
cd $HOME/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots

# List all snapshots
ls

# Change directory to snapshot
cd <snapshot-dir-name>

Upload Model to EOS Bucket

To push the contents of the folder to the EOS bucket, follow these steps:

  1. Go to TIR Dashboard >> Models >> Select your model.
  2. Copy the cp command from the Setup MinIO CLI tab.

The copy command would look like this:

# mc cp -r $FOLDER_NAME tutorial-llama-3-8b-it/tutorial-llama-3-8b-it-cb855d
Note

The model directory name may be a little different (we assume it is models--meta-llama--Meta-Llama-3-8B). In case, this command does not work, list the directories in the below path to identify the model directory $HOME/.cache/huggingface/hub

Step 6: Create an Endpoint for Our Model

With model weights uploaded to the TIR Model Repository, the next step is to launch the endpoint and serve API requests.

  1. Go to Model Endpoints section.

    Model Endpoints

  2. Create a new Endpoint.

    Create New Endpoint

  3. Choose LLAMA-3-8B-IT model card.

    LLAMA-3-8B-IT Model Card

  4. Choose Link with Model Repository and select the model repository created in Step 1.

    Link with Model Repository

  5. Leave the Model Path empty and set the Disk size.

  6. Pick a suitable GPU plan of your choice and set the replicas.

  7. Add an Endpoint Name (e.g., llama-3-custom).

  8. Set Environment Variables.

    Environment Variables

    Compulsory Environment Variables:

    Note: LLAMA-3-8B-IT is a gated model; you will need permission to access it.

    Follow these steps to get access to the model:

    Request Access

    • Go to Account Settings > Access Tokens and create a new API token once approved.

    Create New API Token

    • Copy the API token.

    • HF_TOKEN: Paste the API token key.

  9. Complete the endpoint creation.

    Endpoint creation might take a few minutes. You can monitor endpoint creation through logs in the Log Section.

    Logs


Step 2: Generate Your API Token

The model endpoint API requires a valid auth token, which you'll need to generate. Follow the steps below:

  1. Go to API Tokens section under the project and click Create Token. You can also use an existing token if already created.

    Create API Token

  2. Once created, you'll be able to see the list of API Tokens containing the API Key and Auth Token. You will need this Auth Token in the next step.

    API Tokens


Step 3: Inference Request

  1. When your endpoint is ready, visit the Sample API request section to test your endpoint using curl (replace $AUTH_TOKEN with the token generated in the previous step).

    Sample Request

  2. You can go through the Supported Parameters for inference requests that can be adjusted.

Supported Parameters

FieldDescriptionShapeData Type
text_inputInput text to be used as a prompt for text generation.[-1]TYPE_STRING
max_tokensThe maximum number of tokens to generate in the output text.[-1]TYPE_INT32
bad_wordsA list of words or phrases that should not appear in the generated text.[-1]TYPE_STRING
stop_wordsA list of words that are considered stop words and are excluded from the generation.[-1]TYPE_STRING
end_idThe token ID marking the end of a sequence.[1]TYPE_INT32
pad_idThe token ID used for padding sequences.[1]TYPE_INT32
top_kThe number of highest probability vocabulary tokens to consider for generation.[1]TYPE_INT32
top_pNucleus sampling parameter, limiting the cumulative probability of tokens.[1]TYPE_FP32
temperatureControls the randomness of token selection during generation.[1]TYPE_FP32
length_penaltyPenalty applied to the length of the generated text.[1]TYPE_FP32
repetition_penaltyPenalty applied to repeated sequences in the generated text.[1]TYPE_FP32
min_lengthThe minimum number of tokens in the generated text.[1]TYPE_INT32
presence_penaltyPenalty applied based on the presence of specific tokens in the generated text.[1]TYPE_FP32
frequency_penaltyPenalty applied based on the frequency of tokens in the generated text.[1]TYPE_FP32
random_seedSeed for controlling the randomness of generation.[1]TYPE_UINT64
return_log_probsWhether to return log probabilities for each token.[1]TYPE_BOOL
return_context_logitsWhether to return logits for each token in the context.[1]TYPE_BOOL
return_generation_logitsWhether to return logits for each token in the generated text.[1]TYPE_BOOL
prompt_embedding_tableTable of embeddings for words in the prompt.[-1, -1]TYPE_FP16
prompt_vocab_sizeSize of the vocabulary for prompt embeddings.[1]TYPE_INT32
embedding_bias_wordsWords to bias the word embeddings.[-1]TYPE_STRING
embedding_bias_weightsWeights for the biasing of word embeddings.[-1]TYPE_FP32
cum_log_probsCumulative log probabilities of generated tokens.[-1]TYPE_FP32
output_log_probsLog probabilities of each token in the generated text.[-1, -1]TYPE_FP32
context_logitsLogits for each token in the context.[-1, -1]TYPE_FP32
generation_logitsLogits for each token in the generated text.[-1, -1, -1]TYPE_FP32