VLLM with OpenAI Client

vLLM provides an HTTP server that implements OpenAI’s Completions and Chat API, allowing seamless integration with existing OpenAI-compatible tools.

Using OpenAI Completions API with vLLM

Since this server is compatible with the OpenAI API, you can use it as a drop-in replacement for any application built on the OpenAI client. Below is an example of querying the vLLM server using the OpenAI Python package.

Click to expand code

import openai

auth_token = "$AUTH_TOKEN"  # put your auth token here...

openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"

completion = openai.Completion.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    prompt="San Francisco is a"
)

print("Completion result:", completion)

Parameters

When using the completion feature of the vLLM Serverless Endpoint Worker, you can customize requests using the parameters listed below.

Supported Completions inputs and descriptions

Parameter	Type	Default	Description
model	str	–	The model repo deployed on your vLLM endpoint.
prompt	str / list	–	The input prompt text or tokens for generation.
suffix	str	None	Text appended after the generated sequence.
max_tokens	int	16	Max number of tokens to generate.
temperature	float	1.0	Controls randomness of output. Lower = deterministic.
top_p	float	1.0	Nucleus sampling threshold (0–1).
n	int	1	Number of outputs to return.
stream	bool	False	Whether to stream output responses.
logprobs	int	None	Number of log probabilities per output token.
echo	bool	False	Echo prompt in the completion output.
stop	list / str	list	Tokens or strings to stop generation.
seed	int	None	Random seed for reproducibility.
presence_penalty	float	0.0	Penalizes tokens already used.
frequency_penalty	float	0.0	Penalizes frequent tokens.
best_of	int	None	Generate multiple candidates and return top n.
logit_bias	dict	None	Bias generation toward/against specific tokens.
user	str	None	Optional user identifier.

Additional Parameters

Parameter	Type	Default	Description
top_k	int	-1	Consider top-k tokens.
ignore_eos	bool	False	Ignore end-of-sentence tokens.
use_beam_search	bool	False	Enable beam search generation.
stop_token_ids	list	list	Token IDs to halt generation.
skip_special_tokens	bool	True	Skip special tokens in output.
repetition_penalty	float	1.0	Penalize repetition.
length_penalty	float	1.0	Penalize long sequences.
min_p	float	0.0	Minimum relative probability cutoff.
include_stop_str_in_output	bool	False	Include stop strings in final text.

Using OpenAI Chat API with vLLM (Streaming and Non-Streaming)

The vLLM server supports the OpenAI Chat API, allowing interactive, context-aware conversation generation. You can use either streaming or non-streaming modes.

Python3 Streaming Example

Click to expand code

import openai

auth_token = "$AUTH_TOKEN"  # your auth token
openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"

streamer = openai.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "What are large language models?"}],
    stream=True
)

for chunk in streamer:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Python3 Non-Streaming Example

Click to expand code

import openai

auth_token = "$AUTH_TOKEN"  # your auth token
openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"

completion = openai.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "What is artificial intelligence?"}]
)

print(completion.choices[0].message.content)

Parameters for Chat Completions

When using the chat completion feature of the vLLM Serverless Endpoint Worker, the following parameters are available:

Supported Chat Completions Inputs and Descriptions

Parameter	Type	Default	Description
model	str	–	The deployed model name on your vLLM endpoint.
messages	list	–	A list of role/content pairs for the conversation.
max_tokens	int	16	Max tokens to generate in response.
temperature	float	1.0	Controls randomness of generation.
top_p	float	1.0	Nucleus sampling parameter.
stream	bool	False	Whether to enable streaming responses.
stop	list	list	Stop generation on specified sequences.
seed	int	None	Random seed.
presence_penalty	float	0.0	Penalize token reuse.
frequency_penalty	float	0.0	Penalize frequent words.
repetition_penalty	float	1.0	Penalize repetition intensity.
top_k	int	-1	Consider top-k probable tokens.
ignore_eos	bool	False	Ignore EOS token during generation.
length_penalty	float	1.0	Control length influence in beam search.
include_stop_str_in_output	bool	False	Whether to include stop tokens in output.

Summary

vLLM seamlessly integrates with OpenAI clients, offering both Completions and Chat APIs. It supports fine-tuning via parameter control and enables both synchronous and streaming responses. The setup provides a smooth developer experience while maintaining full compatibility with existing OpenAI-based tools.

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

Using OpenAI Completions API with vLLM​

Parameters​

Using OpenAI Chat API with vLLM (Streaming and Non-Streaming)​

Python3 Streaming Example​

Python3 Non-Streaming Example​

Parameters for Chat Completions​

Summary​

Using OpenAI Completions API with vLLM

Parameters

Using OpenAI Chat API with vLLM (Streaming and Non-Streaming)

Python3 Streaming Example

Python3 Non-Streaming Example

Parameters for Chat Completions

Summary