# Deploy inference endpoint for MPT-7B-CHAT
In this tutorial, we will create a model endpoint for the **MPT-7B-CHAT** model.
The tutorial focuses on:
* Creating a model endpoint using a pre-built container.
* Creating a model endpoint with custom model weights.
* Supported parameters for text generation.
---
## Creating a ready-to-use model endpoint
## Steps to create the endpoint
1. Go to the **Dashboard**.
2. Navigate to **Model Endpoints**.
3. Click **Create New Endpoint**.
4. Choose the **MPT-7B-CHAT** option.
5. Select a suitable GPU plan.
6. Complete endpoint creation.
7. After setup, go to the **Sample API Request** section to test your endpoint using `curl`.
---
## Creating model endpoint with custom model weights
To create an inference endpoint with custom model weights:
1. Download the **MPT-7B-CHAT** model from [Hugging Face](https://huggingface.co/mosaicml/mpt-7b).
2. Upload the model to the **Model Bucket (EOS)**.
3. Create an inference endpoint to serve API requests.
---
## Step 1: Define a model in the dashboard
1. Go to the [AI Platform](https://tir.e2enetworks.com).
2. Choose a project.
3. Navigate to the **Model** section.
4. Click **Create Model**.
5. Enter a model name (e.g., `mpt-7b-chat-1`).
6. Set **Model Type** as **Custom**.
7. Click **CREATE**.
8. Note the EOS bucket details created for your model.
9. EOS provides an S3-compatible API. We will use **MinIO CLI**.
10. Copy the **Setup Host** command from the **Setup MinIO CLI** tab.
---
## Step 2: Start a new Instance
1. In the dashboard, go to **Instance**.
2. Launch a Instance with the **mpt-7b-chat** image and a GPU plan (e.g., A10080).
3. Open the Jupyter Labs environment.
4. Create a new terminal and run the **MinIO setup command** copied earlier.
5. Verify that the **mc CLI** is configured correctly.
---
## Step 3: Download the MPT-7B-CHAT model
Click to expand code
```python
import transformers
import torch
from transformers import AutoTokenizer
model = transformers.AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b-chat',
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')
pipe = transformers.pipeline(
"text-generation",
model='mosaicml/mpt-7b-chat',
tokenizer=tokenizer,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
```
```bash
pip install transformers accelerate xformers einopsx
```
Run a sample inference:
Click to expand code
```python
output = pipe(
'Here is a recipe for vegan banana bread:\n',
do_sample=True,
max_new_tokens=100,
use_cache=True
)
print(output)
```
---
## Step 4: Upload the model to EOS
After confirming the model works, upload it to EOS.
Click to expand code
```bash
# Navigate to the Hugging Face model directory
cd $HOME/.cache/huggingface/hub/models--mpt-7b-chat-1/snapshots
# Upload all contents to EOS
mc cp -r * mpt-7b-chat-1/mpt-b-chat-1-b9402a
```
```bash
ls $HOME/.cache/huggingface/hub
```
---
## Step 5: Create the model endpoint
Once uploaded, return to the dashboard and create a new endpoint. Select the uploaded model under **Model Details**.
Refer to the earlier section on [Creating a ready-to-use model endpoint](#creating-a-ready-to-use-model-endpoint).
---
## Inference
After launching the endpoint, wait a few minutes for the model to load. Then send an API request for prediction.
### Inference request example
```json
{
"instances": [
{ "text": "here's the Recipe of pancakes" }
],
"params": { "max_length": 100 }
}
```
### Inference response example
```json
{
"predictions": [
[
"here's the Recipe of pancakes with strawberries and cream..."
]
]
}
```
You can also pass multiple prompts:
```json
{
"instances": [
{ "text": "here's the tea recipe" },
{ "text": "here's the pancakes recipe" }
],
"params": { "max_length": 100 }
}
```
---
## Supported parameters for output control
### Length control
* **max_length**: Maximum token length.
* **max_new_tokens**: Max tokens to generate (ignores input length).
* **min_length / min_new_tokens**: Minimum tokens.
* **early_stopping**: Boolean or string controlling beam-search stopping behavior.
* **max_time**: Maximum compute time (seconds).
### Generation strategy
* **do_sample**: Enable sampling.
* **num_beams / num_beam_groups**: Beam search and diversity settings.
* **penalty_alpha**: Balances confidence vs. diversity.
* **use_cache**: Uses cached key/values to speed decoding.
### Logit manipulation
* **temperature, top_k, top_p, typical_p**: Control randomness.
* **epsilon_cutoff, eta_cutoff**: Token filtering thresholds.
* **diversity_penalty, repetition_penalty**: Penalize repetition.
* **length_penalty**: Length scaling for beams.
* **no_repeat_ngram_size**: Prevent repeated n-grams.
* **bad_words_ids / force_words_ids**: Restrict token generation.
### Output configuration
* **num_return_sequences**: Number of outputs.
* **output_attentions, output_hidden_states, output_scores**: Return model internals.
* **return_dict_in_generate**: Return structured output.
### Special tokens
* **pad_token_id, bos_token_id, eos_token_id**: Token control options.
### Encoder-decoder specific
* **encoder_no_repeat_ngram_size**: Prevent encoder n-gram repetition.
* **decoder_start_token_id**: Define decoder start token.
---
## Troubleshooting and best practices
* Ensure correct MinIO credentials before upload.
* Verify GPU availability before model deployment.
* If inference fails, check logs in the dashboard.
* Use lower `max_new_tokens` for faster testing.
* Always test endpoints with smaller inputs first.
---