Skip to main content

Deploy Model Endpoint for Codellama-7b

In this tutorial, we will create a model endpoint for Codellama-7b model.

The tutorial will mainly focus on the following:

Model Endpoint creation for Codellama-7b using prebuilt container

When a model endpoint is created in the TIR dashboard, a model server is launched to serve inference requests in the background. TIR platform supports various model formats through pre-built containers (e.g., PyTorch, Triton, LLaMA, MPT, etc.).

For this tutorial, we will use the pre-built container for Codellama-7b. You can also create a custom container by following this tutorial.

In most cases, the pre-built container will work for your use case, and you won’t need to build an API handler manually. The API handler will be automatically created for you.

Steps to create inference endpoint for Codellama-7b model:

Step 1: Create a Model Endpoint

  • Go to TIR AI Platform
  • Choose a project
  • Go to Model Endpoints section
  • Create a new Endpoint
  • Choose Codellama-7b model card
  • Pick a suitable CPU/GPU plan of your choice and set the replicas and disk size (CPU plans may take more time)
  • If you wish to load your custom model weights (fine-tuned or not), select the appropriate model, i.e., the EOS bucket containing your weights. (See Creating Model Endpoint with custom model weights section below)
  • If not, you can skip the model details and proceed further
  • Complete the endpoint creation
  • Model creation might take a few minutes, and you can always see logs in the log section.

Codellama-7b Logs

Step 2: Generate your API_TOKEN

The model endpoint API requires a valid auth token for further steps. Let’s generate one:

  • Go to API Tokens section under the project.
  • Create a new API Token. by clicking on the Create Token button in the top-right corner. You can also use an existing token, if already created.
  • Once created, you'll see a list of API Tokens containing the API Key and Auth Token. You will need this Auth Token in the next step.

AuthToken

Step 3: Inferring Request

  • When your endpoint is ready, visit the Sample API request section to test your endpoint using curl.

Codellama-7b Requests

Creating Model Endpoint with Custom Model Weights

To create inference for Codellama-7b model with custom model weights:

  • Download the Codellama-7b model from Hugging Face
  • Upload the model to the Model Bucket (EOS)
  • Create an inference endpoint (model endpoint) in TIR to serve API requests

Step 1: Define a Model in TIR Dashboard

Before downloading or fine-tuning (optional) the model weights, we need to define a model in the TIR dashboard.

  • Go to TIR AI Platform
  • Choose a project
  • Go to the Model section
  • Click on Create Model
  • Enter a model name of your choice (e.g., tir-model-34)
  • Select Model Type as Custom
  • Click on CREATE
  • You will now see the details of the EOS (E2E Object Storage) bucket created for this model.
  • EOS provides an S3-compatible API to upload or download content. We will use MinIO CLI in this tutorial.
  • Copy the Setup Host command from the Setup Minio CLI tab to a notepad or clipboard for use in the next step.
Note

In case you forget to copy the setup host command for MinIO CLI, don't worry. You can always go back to model details and get it again.

Step 2: Start a new Notebook

To work with the model weights, we first need to download them to a local machine or a notebook instance.

  • In the TIR Dashboard, go to Notebooks
  • Launch a new Notebook with the Transformers Image and a hardware plan (e.g., A10080). We recommend a GPU plan if you plan to test or fine-tune the model.
  • Click on the Notebook name or Launch Notebook option to start the Jupyter Lab environment
  • In Jupyter Lab, click New Launcher and select Terminal
  • Now, paste and run the command for setting up MinIO CLI Host from Step 1
  • If the command works, the MinIO CLI (mc) will be ready for uploading your model

Step 3: Download the Codellama-7b Model from Notebook

Now that our EOS bucket will store the model weights, let us download the weights from Hugging Face.

  • Start a new notebook (untitled.ipynb) in Jupyter Lab
  • Run the following commands to download the model. The model will be downloaded by the Hugging Face SDK into the $HOME/.cache folder:
    
from transformers import AutoTokenizer
import transformers
import torch

model = "codellama/CodeLlama-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
tokenizer=tokenizer)

Note

If you face any issues running above code in the notebook cell, you may be missing required libraries. This may happen if you did not launch the notebook with Transformers image. In such situation, you can install the required libraries below:

  • Let us run a simple inference to test the model.
prompt = "def factorial(num: int):"

sequences = pipeline(prompt,
do_sample=True,
top_k=10,
temperature=0.1,
top_p=0.95,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=200
)
Note

All the supported parameters are listed in Supported Parameters <#supported-parameters>__

Step 4: Upload the model to Model Bucket (EOS)

Now that the model works as expected, you can fine-tune it with your own data or choose to serve the model as-is. This tutorial assumes you are uploading the model as-is to create an inference endpoint. In case you fine-tune the model, you can follow similar steps to upload the model to the EOS bucket.

# Go to the directory that has the Hugging Face model code.
cd $HOME/.cache/huggingface/hub/models--codellama--CodeLlama-7b-hf/snapshots
# Push the contents of the folder to EOS bucket. 
# Go to TIR Dashboard >> Models >> Select your model >> Copy the cp command from Setup MinIO CLI tab.

# The copy command would look like this:
# mc cp -r <MODEL_NAME> codellama-7b/codellama-7b-hf

# Here we replace <MODEL_NAME> with '*' to upload all contents of the snapshots folder

mc cp -r * codellama-7b/codellama-7b-hf
Note

The model directory name may be a little different (we assume it is models--codellama--CodeLlama-7b-hf). In case, this command does not work, list the directories in the below path to identify the model directory.

$HOME/.cache/huggingface/hub

Step 5: Create an endpoint for our model

With model weights uploaded to TIR Model's EOS Bucket, what remains is to just launch the endpoint and serve API requests.

Head back to the section on A guide on Model Endpoint creation <#model-endpoint-creation-for-codellama-7b-using-prebuilt-container>__ above and follow the steps to create the endpoint for your model.

While creating the endpoint, make sure you select the appropriate model in the model details sub-section, i.e., the EOS bucket containing your model weights. If your model is not in the root directory of the bucket, make sure to specify the path where the model is saved in the bucket.

Follow the steps below to find the Model path in the bucket:

  • Go to MyAccount Object Storage <https://myaccount.e2enetworks.com/storage/object-storage>__

  • Find your Model bucket (in this case: codellama-22ec1d) & click on its Objects tab

  • If the model_index.json file is present in the list of objects, then your model is present in the root directory & you need not give any Model Path

  • Otherwise, navigate to the folder, and find the model_index.json file, copy its path and paste the same in the Model Path field

    .. image:: ../images/codellama-bucket-details.png

  • You can click on the Validate button to validate the existence of the model at the given path

    .. image:: ../images/codellama-7b-eos.png

Step 6: Inferring Request

Head back to the section on A guide on Model Endpoint creation <#model-endpoint-creation-for-codellama-7b-using-prebuilt-container>__ above and follow the steps 3 to infer the endpoint.

You can pass the parameters in your request body listed below to control your output while running inference mentioned in step 3.

Supported Parameters

Parameters that control the length of the output

  • max_length (int, optional, defaults to 20) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. Its effect is overridden by max_new_tokens, if also set.
  • max_new_tokens (int, optional) — The maximum number of tokens to generate, ignoring the number of tokens in the prompt.
  • min_length (int, optional, defaults to 0) — The minimum length of the sequence to be generated. Corresponds to the length of the input prompt + min_new_tokens. Its effect is overridden by min_new_tokens, if also set.
  • min_new_tokens (int, optional) — The minimum number of tokens to generate, ignoring the number of tokens in the prompt.
  • early_stopping (bool or str, optional, defaults to False) — Controls the stopping condition for beam-based methods, like beam-search. It accepts the following values: True, where the generation stops as soon as there are num_beams complete candidates; False, where a heuristic is applied and the generation stops when it is very unlikely to find better candidates; "never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).
  • max_time (float, optional) — The maximum amount of time you allow the computation to run for in seconds. Generation will still finish the current pass after the allocated time has passed.

Parameters that control the generation strategy used

  • do_sample (bool, optional, defaults to False) — Whether or not to use sampling; use greedy decoding otherwise.
  • num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.
  • num_beam_groups (int, optional, defaults to 1) — Number of groups to divide num_beams into in order to ensure diversity among different groups of beams.
  • penalty_alpha (float, optional) — The values balance the model confidence and the degeneration penalty in contrastive search decoding.
  • use_cache (bool, optional, defaults to True) — Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.

Parameters for manipulation of the model output logits

  • temperature (float, optional, defaults to 1.0) — The value used to modulate the next token probabilities.
  • top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.
  • top_p (float, optional, defaults to 1.0) — If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
  • typical_p (float, optional, defaults to 1.0) — Local typicality measures how similar the conditional probability of predicting a target token next is to the expected conditional probability of predicting a random token next, given the partial text already generated.
  • epsilon_cutoff (float, optional, defaults to 0.0) — If set to float strictly between 0 and 1, only tokens with a conditional probability greater than epsilon_cutoff will be sampled.
  • eta_cutoff (float, optional, defaults to 0.0) — Eta sampling is a hybrid of locally typical sampling and epsilon sampling.
  • diversity_penalty (float, optional, defaults to 0.0) — This value is subtracted from a beam’s score if it generates a token same as any beam from another group at a particular time.
  • repetition_penalty (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty.
  • encoder_repetition_penalty (float, optional, defaults to 1.0) — The parameter for encoder repetition penalty. An exponential penalty on sequences that are not in the original input. 1.0 means no penalty.
  • length_penalty (float, optional, defaults to 1.0) — Exponential penalty to the length that is used with beam-based generation.
  • no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size can only occur once.
  • bad_words_ids (List[List[int]], optional) — List of lists of token ids that are not allowed to be generated.
  • force_words_ids (List[List[int]] or List[List[List[int]]], optional) — List of token ids that must be generated.
  • renormalize_logits (bool, optional, defaults to False) — Whether to renormalize the logits after applying all the logits processors or warpers.
  • constraints (List[Constraint], optional) — Custom constraints that can be added to the generation to ensure that the output will contain the use of certain tokens as defined by Constraint objects.
  • forced_bos_token_id (int, optional, defaults to model.config.forced_bos_token_id) — The id of the token to force as the first generated token after the decoder_start_token_id.
  • forced_eos_token_id (Union[int, List[int]], optional, defaults to model.config.forced_eos_token_id) — The id of the token to force as the last generated token when max_length is reached.
  • remove_invalid_values (bool, optional, defaults to model.config.remove_invalid_values) — Whether to remove possible NaN and Inf outputs of the model to prevent the generation method from crashing.
  • exponential_decay_length_penalty (tuple(int, float), optional) — This Tuple adds an exponentially increasing length penalty.
  • suppress_tokens (List[int], optional) — A list of tokens that will be suppressed at generation.
  • begin_suppress_tokens (List[int], optional) — A list of tokens that will be suppressed at the beginning of the generation.
  • forced_decoder_ids (List[List[int]], optional) — A list of pairs of integers indicating a mapping from generation indices to token indices that will be forced before sampling.
  • sequence_bias (Dict[Tuple[int], float], optional) — Dictionary that maps a sequence of tokens to its bias term.
  • guidance_scale (float, optional) — The guidance scale for classifier-free guidance (CFG).
  • low_memory (bool, optional) — Switch to sequential topk for contrastive search to reduce peak memory.

Parameters that define the output variables of generate

  • num_return_sequences (int, optional, defaults to 1) — The number of independently computed returned sequences for each element in the batch.
  • output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers.
  • output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers.
  • output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores.
  • return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.

Special tokens that can be used at generation time

  • pad_token_id (int, optional) — The id of the padding token.
  • bos_token_id (int, optional) — The id of the beginning-of-sequence token.
  • eos_token_id (Union[int, List[int]], optional) — The id of the end-of-sequence token.

Generation parameters exclusive to encoder-decoder models

  • encoder_no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size that occur in the encoder_input_ids cannot occur in the decoder_input_ids.
  • decoder_start_token_id (int, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token.