# Deploy inference endpoint for MPT-7B-CHAT In this tutorial, we will create a model endpoint for the **MPT-7B-CHAT** model. The tutorial focuses on: * Creating a model endpoint using a pre-built container. * Creating a model endpoint with custom model weights. * Supported parameters for text generation. --- ## Creating a ready-to-use model endpoint ## Steps to create the endpoint 1. Go to the **Dashboard**. 2. Navigate to **Model Endpoints**. 3. Click **Create New Endpoint**. 4. Choose the **MPT-7B-CHAT** option. 5. Select a suitable GPU plan. 6. Complete endpoint creation. 7. After setup, go to the **Sample API Request** section to test your endpoint using `curl`. --- ## Creating model endpoint with custom model weights To create an inference endpoint with custom model weights: 1. Download the **MPT-7B-CHAT** model from [Hugging Face](https://huggingface.co/mosaicml/mpt-7b). 2. Upload the model to the **Model Bucket (EOS)**. 3. Create an inference endpoint to serve API requests. --- ## Step 1: Define a model in the dashboard 1. Go to the [AI Platform](https://tir.e2enetworks.com). 2. Choose a project. 3. Navigate to the **Model** section. 4. Click **Create Model**. 5. Enter a model name (e.g., `mpt-7b-chat-1`). 6. Set **Model Type** as **Custom**. 7. Click **CREATE**. 8. Note the EOS bucket details created for your model. 9. EOS provides an S3-compatible API. We will use **MinIO CLI**. 10. Copy the **Setup Host** command from the **Setup MinIO CLI** tab. --- ## Step 2: Start a new Instance 1. In the dashboard, go to **Instance**. 2. Launch a Instance with the **mpt-7b-chat** image and a GPU plan (e.g., A10080). 3. Open the Jupyter Labs environment. 4. Create a new terminal and run the **MinIO setup command** copied earlier. 5. Verify that the **mc CLI** is configured correctly. --- ## Step 3: Download the MPT-7B-CHAT model
Click to expand code ```python import transformers import torch from transformers import AutoTokenizer model = transformers.AutoModelForCausalLM.from_pretrained( 'mosaicml/mpt-7b-chat', trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b') pipe = transformers.pipeline( "text-generation", model='mosaicml/mpt-7b-chat', tokenizer=tokenizer, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True ) ```
```bash pip install transformers accelerate xformers einopsx ``` Run a sample inference:
Click to expand code ```python output = pipe( 'Here is a recipe for vegan banana bread:\n', do_sample=True, max_new_tokens=100, use_cache=True ) print(output) ```
--- ## Step 4: Upload the model to EOS After confirming the model works, upload it to EOS.
Click to expand code ```bash # Navigate to the Hugging Face model directory cd $HOME/.cache/huggingface/hub/models--mpt-7b-chat-1/snapshots # Upload all contents to EOS mc cp -r * mpt-7b-chat-1/mpt-b-chat-1-b9402a ```
```bash ls $HOME/.cache/huggingface/hub ``` --- ## Step 5: Create the model endpoint Once uploaded, return to the dashboard and create a new endpoint. Select the uploaded model under **Model Details**. Refer to the earlier section on [Creating a ready-to-use model endpoint](#creating-a-ready-to-use-model-endpoint). --- ## Inference After launching the endpoint, wait a few minutes for the model to load. Then send an API request for prediction. ### Inference request example ```json { "instances": [ { "text": "here's the Recipe of pancakes" } ], "params": { "max_length": 100 } } ``` ### Inference response example ```json { "predictions": [ [ "here's the Recipe of pancakes with strawberries and cream..." ] ] } ``` You can also pass multiple prompts: ```json { "instances": [ { "text": "here's the tea recipe" }, { "text": "here's the pancakes recipe" } ], "params": { "max_length": 100 } } ``` --- ## Supported parameters for output control ### Length control * **max_length**: Maximum token length. * **max_new_tokens**: Max tokens to generate (ignores input length). * **min_length / min_new_tokens**: Minimum tokens. * **early_stopping**: Boolean or string controlling beam-search stopping behavior. * **max_time**: Maximum compute time (seconds). ### Generation strategy * **do_sample**: Enable sampling. * **num_beams / num_beam_groups**: Beam search and diversity settings. * **penalty_alpha**: Balances confidence vs. diversity. * **use_cache**: Uses cached key/values to speed decoding. ### Logit manipulation * **temperature, top_k, top_p, typical_p**: Control randomness. * **epsilon_cutoff, eta_cutoff**: Token filtering thresholds. * **diversity_penalty, repetition_penalty**: Penalize repetition. * **length_penalty**: Length scaling for beams. * **no_repeat_ngram_size**: Prevent repeated n-grams. * **bad_words_ids / force_words_ids**: Restrict token generation. ### Output configuration * **num_return_sequences**: Number of outputs. * **output_attentions, output_hidden_states, output_scores**: Return model internals. * **return_dict_in_generate**: Return structured output. ### Special tokens * **pad_token_id, bos_token_id, eos_token_id**: Token control options. ### Encoder-decoder specific * **encoder_no_repeat_ngram_size**: Prevent encoder n-gram repetition. * **decoder_start_token_id**: Define decoder start token. --- ## Troubleshooting and best practices * Ensure correct MinIO credentials before upload. * Verify GPU availability before model deployment. * If inference fails, check logs in the dashboard. * Use lower `max_new_tokens` for faster testing. * Always test endpoints with smaller inputs first. ---