# VLLM with OpenAI Client
vLLM provides an HTTP server that implements OpenAI’s **Completions** and **Chat API**, allowing seamless integration with existing OpenAI-compatible tools.
---
## Using OpenAI Completions API with vLLM
Since this server is compatible with the OpenAI API, you can use it as a **drop-in replacement** for any application built on the OpenAI client. Below is an example of querying the vLLM server using the OpenAI Python package.
Click to expand code
```python
import openai
auth_token = "$AUTH_TOKEN" # put your auth token here...
openai.api_key = auth_token
openai.base_url = ""
completion = openai.Completion.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
prompt="San Francisco is a"
)
print("Completion result:", completion)
```
---
## Parameters
When using the completion feature of the vLLM Serverless Endpoint Worker, you can customize requests using the parameters listed below.
Supported Completions inputs and descriptions
| Parameter | Type | Default | Description |
| ----------------- | ---------- | ------- | ----------------------------------------------------- |
| model | str | – | The model repo deployed on your vLLM endpoint. |
| prompt | str / list | – | The input prompt text or tokens for generation. |
| suffix | str | None | Text appended after the generated sequence. |
| max_tokens | int | 16 | Max number of tokens to generate. |
| temperature | float | 1.0 | Controls randomness of output. Lower = deterministic. |
| top_p | float | 1.0 | Nucleus sampling threshold (0–1). |
| n | int | 1 | Number of outputs to return. |
| stream | bool | False | Whether to stream output responses. |
| logprobs | int | None | Number of log probabilities per output token. |
| echo | bool | False | Echo prompt in the completion output. |
| stop | list / str | list | Tokens or strings to stop generation. |
| seed | int | None | Random seed for reproducibility. |
| presence_penalty | float | 0.0 | Penalizes tokens already used. |
| frequency_penalty | float | 0.0 | Penalizes frequent tokens. |
| best_of | int | None | Generate multiple candidates and return top n. |
| logit_bias | dict | None | Bias generation toward/against specific tokens. |
| user | str | None | Optional user identifier. |
**Additional Parameters**
| Parameter | Type | Default | Description |
| -------------------------- | ----- | ------- | ------------------------------------ |
| top_k | int | -1 | Consider top-k tokens. |
| ignore_eos | bool | False | Ignore end-of-sentence tokens. |
| use_beam_search | bool | False | Enable beam search generation. |
| stop_token_ids | list | list | Token IDs to halt generation. |
| skip_special_tokens | bool | True | Skip special tokens in output. |
| repetition_penalty | float | 1.0 | Penalize repetition. |
| length_penalty | float | 1.0 | Penalize long sequences. |
| min_p | float | 0.0 | Minimum relative probability cutoff. |
| include_stop_str_in_output | bool | False | Include stop strings in final text. |
---
## Using OpenAI Chat API with vLLM (Streaming and Non-Streaming)
The vLLM server supports the **OpenAI Chat API**, allowing interactive, context-aware conversation generation. You can use either **streaming** or **non-streaming** modes.
### Python3 Streaming Example
Click to expand code
```python
import openai
auth_token = "$AUTH_TOKEN" # your auth token
openai.api_key = auth_token
openai.base_url = ""
streamer = openai.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "What are large language models?"}],
stream=True
)
for chunk in streamer:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
### Python3 Non-Streaming Example
Click to expand code
```python
import openai
auth_token = "$AUTH_TOKEN" # your auth token
openai.api_key = auth_token
openai.base_url = ""
completion = openai.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "What is artificial intelligence?"}]
)
print(completion.choices[0].message.content)
```
---
## Parameters for Chat Completions
When using the chat completion feature of the vLLM Serverless Endpoint Worker, the following parameters are available:
Supported Chat Completions Inputs and Descriptions
| Parameter | Type | Default | Description |
| -------------------------- | ----- | ------- | -------------------------------------------------- |
| model | str | – | The deployed model name on your vLLM endpoint. |
| messages | list | – | A list of role/content pairs for the conversation. |
| max_tokens | int | 16 | Max tokens to generate in response. |
| temperature | float | 1.0 | Controls randomness of generation. |
| top_p | float | 1.0 | Nucleus sampling parameter. |
| stream | bool | False | Whether to enable streaming responses. |
| stop | list | list | Stop generation on specified sequences. |
| seed | int | None | Random seed. |
| presence_penalty | float | 0.0 | Penalize token reuse. |
| frequency_penalty | float | 0.0 | Penalize frequent words. |
| repetition_penalty | float | 1.0 | Penalize repetition intensity. |
| top_k | int | -1 | Consider top-k probable tokens. |
| ignore_eos | bool | False | Ignore EOS token during generation. |
| length_penalty | float | 1.0 | Control length influence in beam search. |
| include_stop_str_in_output | bool | False | Whether to include stop tokens in output. |
---
## Summary
vLLM seamlessly integrates with OpenAI clients, offering both Completions and Chat APIs. It supports fine-tuning via parameter control and enables both synchronous and streaming responses. The setup provides a smooth developer experience while maintaining full compatibility with existing OpenAI-based tools.
---