Deploy inference endpoint for MPT-7B-CHAT
In this tutorial, we will create a model endpoint for the MPT-7B-CHAT model.
The tutorial focuses on:
- Creating a model endpoint using a pre-built container.
- Creating a model endpoint with custom model weights.
- Supported parameters for text generation.
Creating a ready-to-use model endpoint
Steps to create the endpoint
- Go to the Dashboard.
- Navigate to Model Endpoints.
- Click Create New Endpoint.
- Choose the MPT-7B-CHAT option.
- Select a suitable GPU plan.
- Complete endpoint creation.
- After setup, go to the Sample API Request section to test your endpoint using
curl.
Creating model endpoint with custom model weights
To create an inference endpoint with custom model weights:
- Download the MPT-7B-CHAT model from Hugging Face.
- Upload the model to the Model Bucket (EOS).
- Create an inference endpoint to serve API requests.
Step 1: Define a model in the dashboard
- Go to the AI Platform.
- Choose a project.
- Navigate to the Model section.
- Click Create Model.
- Enter a model name (e.g.,
mpt-7b-chat-1). - Set Model Type as Custom.
- Click CREATE.
- Note the EOS bucket details created for your model.
- EOS provides an S3-compatible API. We will use MinIO CLI.
- Copy the Setup Host command from the Setup MinIO CLI tab.
Step 2: Start a new Instance
- In the dashboard, go to Instance.
- Launch a Instance with the mpt-7b-chat image and a GPU plan (e.g., A10080).
- Open the Jupyter Labs environment.
- Create a new terminal and run the MinIO setup command copied earlier.
- Verify that the mc CLI is configured correctly.
Step 3: Download the MPT-7B-CHAT model
Click to expand code
import transformers
import torch
from transformers import AutoTokenizer
model = transformers.AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b-chat',
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')
pipe = transformers.pipeline(
"text-generation",
model='mosaicml/mpt-7b-chat',
tokenizer=tokenizer,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
pip install transformers accelerate xformers einopsx
Run a sample inference:
Click to expand code
output = pipe(
'Here is a recipe for vegan banana bread:\n',
do_sample=True,
max_new_tokens=100,
use_cache=True
)
print(output)
Step 4: Upload the model to EOS
After confirming the model works, upload it to EOS.
Click to expand code
# Navigate to the Hugging Face model directory
cd $HOME/.cache/huggingface/hub/models--mpt-7b-chat-1/snapshots
# Upload all contents to EOS
mc cp -r * mpt-7b-chat-1/mpt-b-chat-1-b9402a
ls $HOME/.cache/huggingface/hub
Step 5: Create the model endpoint
Once uploaded, return to the dashboard and create a new endpoint. Select the uploaded model under Model Details.
Refer to the earlier section on Creating a ready-to-use model endpoint.
Inference
After launching the endpoint, wait a few minutes for the model to load. Then send an API request for prediction.
Inference request example
{
"instances": [
{ "text": "here's the Recipe of pancakes" }
],
"params": { "max_length": 100 }
}
Inference response example
{
"predictions": [
[
"here's the Recipe of pancakes with strawberries and cream..."
]
]
}
You can also pass multiple prompts:
{
"instances": [
{ "text": "here's the tea recipe" },
{ "text": "here's the pancakes recipe" }
],
"params": { "max_length": 100 }
}
Supported parameters for output control
Length control
- max_length: Maximum token length.
- max_new_tokens: Max tokens to generate (ignores input length).
- min_length / min_new_tokens: Minimum tokens.
- early_stopping: Boolean or string controlling beam-search stopping behavior.
- max_time: Maximum compute time (seconds).
Generation strategy
- do_sample: Enable sampling.
- num_beams / num_beam_groups: Beam search and diversity settings.
- penalty_alpha: Balances confidence vs. diversity.
- use_cache: Uses cached key/values to speed decoding.
Logit manipulation
- temperature, top_k, top_p, typical_p: Control randomness.
- epsilon_cutoff, eta_cutoff: Token filtering thresholds.
- diversity_penalty, repetition_penalty: Penalize repetition.
- length_penalty: Length scaling for beams.
- no_repeat_ngram_size: Prevent repeated n-grams.
- bad_words_ids / force_words_ids: Restrict token generation.
Output configuration
- num_return_sequences: Number of outputs.
- output_attentions, output_hidden_states, output_scores: Return model internals.
- return_dict_in_generate: Return structured output.
Special tokens
- pad_token_id, bos_token_id, eos_token_id: Token control options.
Encoder-decoder specific
- encoder_no_repeat_ngram_size: Prevent encoder n-gram repetition.
- decoder_start_token_id: Define decoder start token.
Troubleshooting and best practices
- Ensure correct MinIO credentials before upload.
- Verify GPU availability before model deployment.
- If inference fails, check logs in the dashboard.
- Use lower
max_new_tokensfor faster testing. - Always test endpoints with smaller inputs first.