Skip to main content

VLLM with OpenAI Client

vLLM provides an HTTP server that implements OpenAI’s Completions and Chat API, allowing seamless integration with existing OpenAI-compatible tools.


Using OpenAI Completions API with vLLM

Since this server is compatible with the OpenAI API, you can use it as a drop-in replacement for any application built on the OpenAI client. Below is an example of querying the vLLM server using the OpenAI Python package.

Click to expand code
import openai

auth_token = "$AUTH_TOKEN" # put your auth token here...

openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"

completion = openai.Completion.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
prompt="San Francisco is a"
)

print("Completion result:", completion)

Parameters

When using the completion feature of the vLLM Serverless Endpoint Worker, you can customize requests using the parameters listed below.

Supported Completions inputs and descriptions
ParameterTypeDefaultDescription
modelstrThe model repo deployed on your vLLM endpoint.
promptstr / listThe input prompt text or tokens for generation.
suffixstrNoneText appended after the generated sequence.
max_tokensint16Max number of tokens to generate.
temperaturefloat1.0Controls randomness of output. Lower = deterministic.
top_pfloat1.0Nucleus sampling threshold (0–1).
nint1Number of outputs to return.
streamboolFalseWhether to stream output responses.
logprobsintNoneNumber of log probabilities per output token.
echoboolFalseEcho prompt in the completion output.
stoplist / strlistTokens or strings to stop generation.
seedintNoneRandom seed for reproducibility.
presence_penaltyfloat0.0Penalizes tokens already used.
frequency_penaltyfloat0.0Penalizes frequent tokens.
best_ofintNoneGenerate multiple candidates and return top n.
logit_biasdictNoneBias generation toward/against specific tokens.
userstrNoneOptional user identifier.

Additional Parameters

ParameterTypeDefaultDescription
top_kint-1Consider top-k tokens.
ignore_eosboolFalseIgnore end-of-sentence tokens.
use_beam_searchboolFalseEnable beam search generation.
stop_token_idslistlistToken IDs to halt generation.
skip_special_tokensboolTrueSkip special tokens in output.
repetition_penaltyfloat1.0Penalize repetition.
length_penaltyfloat1.0Penalize long sequences.
min_pfloat0.0Minimum relative probability cutoff.
include_stop_str_in_outputboolFalseInclude stop strings in final text.

Using OpenAI Chat API with vLLM (Streaming and Non-Streaming)

The vLLM server supports the OpenAI Chat API, allowing interactive, context-aware conversation generation. You can use either streaming or non-streaming modes.

Python3 Streaming Example

Click to expand code
import openai

auth_token = "$AUTH_TOKEN" # your auth token
openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"

streamer = openai.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "What are large language models?"}],
stream=True
)

for chunk in streamer:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")

Python3 Non-Streaming Example

Click to expand code
import openai

auth_token = "$AUTH_TOKEN" # your auth token
openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"

completion = openai.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "What is artificial intelligence?"}]
)

print(completion.choices[0].message.content)

Parameters for Chat Completions

When using the chat completion feature of the vLLM Serverless Endpoint Worker, the following parameters are available:

Supported Chat Completions Inputs and Descriptions
ParameterTypeDefaultDescription
modelstrThe deployed model name on your vLLM endpoint.
messageslistA list of role/content pairs for the conversation.
max_tokensint16Max tokens to generate in response.
temperaturefloat1.0Controls randomness of generation.
top_pfloat1.0Nucleus sampling parameter.
streamboolFalseWhether to enable streaming responses.
stoplistlistStop generation on specified sequences.
seedintNoneRandom seed.
presence_penaltyfloat0.0Penalize token reuse.
frequency_penaltyfloat0.0Penalize frequent words.
repetition_penaltyfloat1.0Penalize repetition intensity.
top_kint-1Consider top-k probable tokens.
ignore_eosboolFalseIgnore EOS token during generation.
length_penaltyfloat1.0Control length influence in beam search.
include_stop_str_in_outputboolFalseWhether to include stop tokens in output.

Summary

vLLM seamlessly integrates with OpenAI clients, offering both Completions and Chat APIs. It supports fine-tuning via parameter control and enables both synchronous and streaming responses. The setup provides a smooth developer experience while maintaining full compatibility with existing OpenAI-based tools.