VLLM with OpenAI Client
vLLM provides an HTTP server that implements OpenAI’s Completions and Chat API.
Using OpenAI Completions API with vLLM
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the OpenAI Python package:
import openai
auth_token = "$AUTH_TOKEN" # put your auth token here...
openai.api_key = auth_token
openai.base_url = ""
completion = openai.Completion.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
prompt="San Francisco is a"
)
print("Completion result:", completion)
Parameters
When using the chat completion feature of the vLLM Serverless Endpoint Worker, you can customize your requests with the following parameters.
Supported Completions inputs and descriptions
Supported Completions Inputs and Descriptions
Parameter | Type | Default Value | Description |
---|---|---|---|
model | str | The model repo that you've deployed on your RunPod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the Examples: Using your RunPod endpoint with OpenAI section. | |
prompt | Union[List[int], List[List[int]], str, List[str] | A string, array of strings, array of tokens, or array of token arrays to be used as the input for the model. | |
suffix | Optional[str] | None | A string to be appended to the end of the generated text. |
max_tokens | Optional[int] | 16 | Maximum number of tokens to generate per output sequence. |
temperature | Optional[float] | 1.0 | Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. |
top_p | Optional[float] | 1.0 | Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
n | Optional[int] | 1 | Number of output sequences to return for the given prompt. |
stream | Optional[bool] | False | Whether to stream the output. |
logprobs | Optional[int] | None | Number of log probabilities to return per output token. |
echo | Optional[bool] | False | Whether to echo back the prompt in addition to the completion. |
stop | Optional[Union[str, List[str]]] | list | List of strings that stop the generation when they are generated. The returned output will not contain the stop strings. |
seed | Optional[int] | None | Random seed to use for the generation. |
presence_penalty | Optional[float] | 0.0 | Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. |
frequency_penalty | Optional[float] | 0.0 | Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. |
best_of | Optional[int] | None | Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This parameter influences the diversity of the output. |
logit_bias | Optional[Dict[str, float]] | None | Dictionary of token IDs to biases. |
user | Optional[str] | None | User identifier for personalizing responses. (Unsupported by vLLM) |
Additional Parameters Supported by vLLM
Parameter | Type | Default Value | Description |
---|---|---|---|
top_k | Optional[int] | -1 | Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens. |
ignore_eos | Optional[bool] | False | Whether to ignore the End Of Sentence token and continue generating tokens after the EOS token is generated. |
use_beam_search | Optional[bool] | False | Whether to use beam search instead of sampling for generating outputs. |
stop_token_ids | Optional[List[int]] | list | List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens. |
skip_special_tokens | Optional[bool] | True | Whether to skip special tokens in the output. |
spaces_between_special_tokens | Optional[bool] | True | Whether to add spaces between special tokens in the output. Defaults to True. |
repetition_penalty | Optional[float] | 1.0 | Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. |
min_p | Optional[float] | 0.0 | Float that represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable. |
length_penalty | Optional[float] | 1.0 | Float that penalizes sequences based on their length. Used in beam search. |
include_stop_str_in_output | Optional[bool] | False | Whether to include the stop strings in output text. Defaults to False. |
Using OpenAI Chat API with vLLM for Streaming and Non-Streaming
The vLLM server is designed to support the OpenAI Chat API, allowing you to engage in dynamic conversations with the model. The chat interface is a more interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
- Python3 Streaming
- Python3 Non-Streaming
import openai
auth_token = "$AUTH_TOKEN" # put your auth token here...
openai.api_key = auth_token
openai.base_url = " "
streamer = openai.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{
"role": "user",
"content": "What are large language models?",
},
],
stream=True
)
for chunk in streamer:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end ="")
import openai
auth_token = "$AUTH_TOKEN"
openai.api_key = auth_token
openai.base_url = " "
completion = openai.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{
"role": "user",
"content": "what is artificial intelligence?",
},
],
)
print(completion.choices[0].message.content)
Parameters
When using the chat completion feature of the vLLM Serverless Endpoint Worker, you can customize your requests with the following parameters.
Supported Chat Completions Inputs and Descriptions
Parameter | Type | Default Value | Description |
---|---|---|---|
model | str | The model repo that you've deployed on your RunPod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the Examples: Using your RunPod endpoint with OpenAI section. | |
prompt | Union[List[int], List[List[int]], str, List[str]] | A string, array of strings, array of tokens, or array of token arrays to be used as the input for the model. | |
suffix | Optional[str] | None | A string to be appended to the end of the generated text. |
max_tokens | Optional[int] | 16 | Maximum number of tokens to generate per output sequence. |
temperature | Optional[float] | 1.0 | Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. |
top_p | Optional[float] | 1.0 | Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
n | Optional[int] | 1 | Number of output sequences to return for the given prompt. |
stream | Optional[bool] | False | Whether to stream the output. |
logprobs | Optional[int] | None | Number of log probabilities to return per output token. |
echo | Optional[bool] | False | Whether to echo back the prompt in addition to the completion. |
stop | Optional[Union[str, List[str]]] | list | List of strings that stop the generation when they are generated. The returned output will not contain the stop strings. |
seed | Optional[int] | None | Random seed to use for the generation. |
presence_penalty | Optional[float] | 0.0 | Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. |
frequency_penalty | Optional[float] | 0.0 | Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. |
best_of | Optional[int] | None | Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This parameter influences the diversity of the output. |
logit_bias | Optional[Dict[str, float]] | None | Dictionary of token IDs to biases. |
user | Optional[str] | None | User identifier for personalizing responses. (Unsupported by vLLM) |
Parameter | Type | Default Value | Description |
---|---|---|---|
top_k | Optional[int] | -1 | Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens. |
ignore_eos | Optional[bool] | False | Whether to ignore the End Of Sentence token and continue generating tokens after the EOS token is generated. |
use_beam_search | Optional[bool] | False | Whether to use beam search instead of sampling for generating outputs. |
stop_token_ids | Optional[List[int]] | list | List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens. |
skip_special_tokens | Optional[bool] | True | Whether to skip special tokens in the output. |
spaces_between_special_tokens | Optional[bool] | True | Whether to add spaces between special tokens in the output. Defaults to True. |
repetition_penalty | Optional[float] | 1.0 | Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. |
min_p | Optional[float] | 0.0 | Float that represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable. |
length_penalty | Optional[float] | 1.0 | Float that penalizes sequences based on their length. Used in beam search. |
include_stop_str_in_output | Optional[bool] | False | Whether to include the stop strings in output text. Defaults to False. |