VLLM with OpenAI Client
vLLM provides an HTTP server that implements OpenAI’s Completions and Chat API, allowing seamless integration with existing OpenAI-compatible tools.
Using OpenAI Completions API with vLLM
Since this server is compatible with the OpenAI API, you can use it as a drop-in replacement for any application built on the OpenAI client. Below is an example of querying the vLLM server using the OpenAI Python package.
Click to expand code
import openai
auth_token = "$AUTH_TOKEN" # put your auth token here...
openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"
completion = openai.Completion.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
prompt="San Francisco is a"
)
print("Completion result:", completion)
Parameters
When using the completion feature of the vLLM Serverless Endpoint Worker, you can customize requests using the parameters listed below.
Supported Completions inputs and descriptions
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | str | – | The model repo deployed on your vLLM endpoint. |
| prompt | str / list | – | The input prompt text or tokens for generation. |
| suffix | str | None | Text appended after the generated sequence. |
| max_tokens | int | 16 | Max number of tokens to generate. |
| temperature | float | 1.0 | Controls randomness of output. Lower = deterministic. |
| top_p | float | 1.0 | Nucleus sampling threshold (0–1). |
| n | int | 1 | Number of outputs to return. |
| stream | bool | False | Whether to stream output responses. |
| logprobs | int | None | Number of log probabilities per output token. |
| echo | bool | False | Echo prompt in the completion output. |
| stop | list / str | list | Tokens or strings to stop generation. |
| seed | int | None | Random seed for reproducibility. |
| presence_penalty | float | 0.0 | Penalizes tokens already used. |
| frequency_penalty | float | 0.0 | Penalizes frequent tokens. |
| best_of | int | None | Generate multiple candidates and return top n. |
| logit_bias | dict | None | Bias generation toward/against specific tokens. |
| user | str | None | Optional user identifier. |
Additional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| top_k | int | -1 | Consider top-k tokens. |
| ignore_eos | bool | False | Ignore end-of-sentence tokens. |
| use_beam_search | bool | False | Enable beam search generation. |
| stop_token_ids | list | list | Token IDs to halt generation. |
| skip_special_tokens | bool | True | Skip special tokens in output. |
| repetition_penalty | float | 1.0 | Penalize repetition. |
| length_penalty | float | 1.0 | Penalize long sequences. |
| min_p | float | 0.0 | Minimum relative probability cutoff. |
| include_stop_str_in_output | bool | False | Include stop strings in final text. |
Using OpenAI Chat API with vLLM (Streaming and Non-Streaming)
The vLLM server supports the OpenAI Chat API, allowing interactive, context-aware conversation generation. You can use either streaming or non-streaming modes.
Python3 Streaming Example
Click to expand code
import openai
auth_token = "$AUTH_TOKEN" # your auth token
openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"
streamer = openai.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "What are large language models?"}],
stream=True
)
for chunk in streamer:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Python3 Non-Streaming Example
Click to expand code
import openai
auth_token = "$AUTH_TOKEN" # your auth token
openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"
completion = openai.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "What is artificial intelligence?"}]
)
print(completion.choices[0].message.content)
Parameters for Chat Completions
When using the chat completion feature of the vLLM Serverless Endpoint Worker, the following parameters are available:
Supported Chat Completions Inputs and Descriptions
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | str | – | The deployed model name on your vLLM endpoint. |
| messages | list | – | A list of role/content pairs for the conversation. |
| max_tokens | int | 16 | Max tokens to generate in response. |
| temperature | float | 1.0 | Controls randomness of generation. |
| top_p | float | 1.0 | Nucleus sampling parameter. |
| stream | bool | False | Whether to enable streaming responses. |
| stop | list | list | Stop generation on specified sequences. |
| seed | int | None | Random seed. |
| presence_penalty | float | 0.0 | Penalize token reuse. |
| frequency_penalty | float | 0.0 | Penalize frequent words. |
| repetition_penalty | float | 1.0 | Penalize repetition intensity. |
| top_k | int | -1 | Consider top-k probable tokens. |
| ignore_eos | bool | False | Ignore EOS token during generation. |
| length_penalty | float | 1.0 | Control length influence in beam search. |
| include_stop_str_in_output | bool | False | Whether to include stop tokens in output. |
Summary
vLLM seamlessly integrates with OpenAI clients, offering both Completions and Chat APIs. It supports fine-tuning via parameter control and enables both synchronous and streaming responses. The setup provides a smooth developer experience while maintaining full compatibility with existing OpenAI-based tools.