# Deploy inference endpoint for MPT-7B-CHAT

In this tutorial, we will create a model endpoint for the **MPT-7B-CHAT** model.

The tutorial focuses on:

* Creating a model endpoint using a pre-built container.
* Creating a model endpoint with custom model weights.
* Supported parameters for text generation.

---

## Creating a ready-to-use model endpoint

## Steps to create the endpoint

1. Go to the **Dashboard**.
2. Navigate to **Model Endpoints**.
3. Click **Create New Endpoint**.
4. Choose the **MPT-7B-CHAT** option.
5. Select a suitable GPU plan.
6. Complete endpoint creation.
7. After setup, go to the **Sample API Request** section to test your endpoint using `curl`.

---

## Creating model endpoint with custom model weights

To create an inference endpoint with custom model weights:

1. Download the **MPT-7B-CHAT** model from [Hugging Face](https://huggingface.co/mosaicml/mpt-7b).
2. Upload the model to the **Model Bucket (EOS)**.
3. Create an inference endpoint to serve API requests.

---

## Step 1: Define a model in the dashboard

1. Go to the [AI Platform](https://tir.e2enetworks.com).
2. Choose a project.
3. Navigate to the **Model** section.
4. Click **Create Model**.
5. Enter a model name (e.g., `mpt-7b-chat-1`).
6. Set **Model Type** as **Custom**.
7. Click **CREATE**.
8. Note the EOS bucket details created for your model.
9. EOS provides an S3-compatible API. We will use **MinIO CLI**.
10. Copy the **Setup Host** command from the **Setup MinIO CLI** tab.

---

## Step 2: Start a new Instance

1. In the dashboard, go to **Instance**.
2. Launch a Instance with the **mpt-7b-chat** image and a GPU plan (e.g., A10080).
3. Open the Jupyter Labs environment.
4. Create a new terminal and run the **MinIO setup command** copied earlier.
5. Verify that the **mc CLI** is configured correctly.

---

## Step 3: Download the MPT-7B-CHAT model

<details>
<summary>Click to expand code</summary>

```python
import transformers
import torch
from transformers import AutoTokenizer

model = transformers.AutoModelForCausalLM.from_pretrained(
    'mosaicml/mpt-7b-chat',
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')

pipe = transformers.pipeline(
    "text-generation",
    model='mosaicml/mpt-7b-chat',
    tokenizer=tokenizer,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)
```

</details>

```bash
pip install transformers accelerate xformers einopsx
```

Run a sample inference:

<details>
<summary>Click to expand code</summary>

```python
output = pipe(
    'Here is a recipe for vegan banana bread:\n',
    do_sample=True,
    max_new_tokens=100,
    use_cache=True
)
print(output)
```

</details>

---

## Step 4: Upload the model to EOS

After confirming the model works, upload it to EOS.

<details>
<summary>Click to expand code</summary>

```bash
# Navigate to the Hugging Face model directory
cd $HOME/.cache/huggingface/hub/models--mpt-7b-chat-1/snapshots

# Upload all contents to EOS
mc cp -r * mpt-7b-chat-1/mpt-b-chat-1-b9402a
```

</details>

```bash
ls $HOME/.cache/huggingface/hub
```

---

## Step 5: Create the model endpoint

Once uploaded, return to the dashboard and create a new endpoint. Select the uploaded model under **Model Details**.

Refer to the earlier section on [Creating a ready-to-use model endpoint](#creating-a-ready-to-use-model-endpoint).

---

## Inference

After launching the endpoint, wait a few minutes for the model to load. Then send an API request for prediction.

### Inference request example

```json
{
  "instances": [
    { "text": "here's the Recipe of pancakes" }
  ],
  "params": { "max_length": 100 }
}
```

### Inference response example

```json
{
  "predictions": [
    [
      "here's the Recipe of pancakes with strawberries and cream..."
    ]
  ]
}
```

You can also pass multiple prompts:

```json
{
  "instances": [
    { "text": "here's the tea recipe" },
    { "text": "here's the pancakes recipe" }
  ],
  "params": { "max_length": 100 }
}
```

---

## Supported parameters for output control

### Length control

* **max_length**: Maximum token length.
* **max_new_tokens**: Max tokens to generate (ignores input length).
* **min_length / min_new_tokens**: Minimum tokens.
* **early_stopping**: Boolean or string controlling beam-search stopping behavior.
* **max_time**: Maximum compute time (seconds).

### Generation strategy

* **do_sample**: Enable sampling.
* **num_beams / num_beam_groups**: Beam search and diversity settings.
* **penalty_alpha**: Balances confidence vs. diversity.
* **use_cache**: Uses cached key/values to speed decoding.

### Logit manipulation

* **temperature, top_k, top_p, typical_p**: Control randomness.
* **epsilon_cutoff, eta_cutoff**: Token filtering thresholds.
* **diversity_penalty, repetition_penalty**: Penalize repetition.
* **length_penalty**: Length scaling for beams.
* **no_repeat_ngram_size**: Prevent repeated n-grams.
* **bad_words_ids / force_words_ids**: Restrict token generation.

### Output configuration

* **num_return_sequences**: Number of outputs.
* **output_attentions, output_hidden_states, output_scores**: Return model internals.
* **return_dict_in_generate**: Return structured output.

### Special tokens

* **pad_token_id, bos_token_id, eos_token_id**: Token control options.

### Encoder-decoder specific

* **encoder_no_repeat_ngram_size**: Prevent encoder n-gram repetition.
* **decoder_start_token_id**: Define decoder start token.

---

## Troubleshooting and best practices

* Ensure correct MinIO credentials before upload.
* Verify GPU availability before model deployment.
* If inference fails, check logs in the dashboard.
* Use lower `max_new_tokens` for faster testing.
* Always test endpoints with smaller inputs first.


---