# VLLM with OpenAI Client vLLM provides an HTTP server that implements OpenAI’s **Completions** and **Chat API**, allowing seamless integration with existing OpenAI-compatible tools. --- ## Using OpenAI Completions API with vLLM Since this server is compatible with the OpenAI API, you can use it as a **drop-in replacement** for any application built on the OpenAI client. Below is an example of querying the vLLM server using the OpenAI Python package.
Click to expand code ```python import openai auth_token = "$AUTH_TOKEN" # put your auth token here... openai.api_key = auth_token openai.base_url = "" completion = openai.Completion.create( model="meta-llama/Meta-Llama-3-8B-Instruct", prompt="San Francisco is a" ) print("Completion result:", completion) ```
--- ## Parameters When using the completion feature of the vLLM Serverless Endpoint Worker, you can customize requests using the parameters listed below.
Supported Completions inputs and descriptions | Parameter | Type | Default | Description | | ----------------- | ---------- | ------- | ----------------------------------------------------- | | model | str | – | The model repo deployed on your vLLM endpoint. | | prompt | str / list | – | The input prompt text or tokens for generation. | | suffix | str | None | Text appended after the generated sequence. | | max_tokens | int | 16 | Max number of tokens to generate. | | temperature | float | 1.0 | Controls randomness of output. Lower = deterministic. | | top_p | float | 1.0 | Nucleus sampling threshold (0–1). | | n | int | 1 | Number of outputs to return. | | stream | bool | False | Whether to stream output responses. | | logprobs | int | None | Number of log probabilities per output token. | | echo | bool | False | Echo prompt in the completion output. | | stop | list / str | list | Tokens or strings to stop generation. | | seed | int | None | Random seed for reproducibility. | | presence_penalty | float | 0.0 | Penalizes tokens already used. | | frequency_penalty | float | 0.0 | Penalizes frequent tokens. | | best_of | int | None | Generate multiple candidates and return top n. | | logit_bias | dict | None | Bias generation toward/against specific tokens. | | user | str | None | Optional user identifier. | **Additional Parameters** | Parameter | Type | Default | Description | | -------------------------- | ----- | ------- | ------------------------------------ | | top_k | int | -1 | Consider top-k tokens. | | ignore_eos | bool | False | Ignore end-of-sentence tokens. | | use_beam_search | bool | False | Enable beam search generation. | | stop_token_ids | list | list | Token IDs to halt generation. | | skip_special_tokens | bool | True | Skip special tokens in output. | | repetition_penalty | float | 1.0 | Penalize repetition. | | length_penalty | float | 1.0 | Penalize long sequences. | | min_p | float | 0.0 | Minimum relative probability cutoff. | | include_stop_str_in_output | bool | False | Include stop strings in final text. |
--- ## Using OpenAI Chat API with vLLM (Streaming and Non-Streaming) The vLLM server supports the **OpenAI Chat API**, allowing interactive, context-aware conversation generation. You can use either **streaming** or **non-streaming** modes. ### Python3 Streaming Example
Click to expand code ```python import openai auth_token = "$AUTH_TOKEN" # your auth token openai.api_key = auth_token openai.base_url = "" streamer = openai.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "What are large language models?"}], stream=True ) for chunk in streamer: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") ```
### Python3 Non-Streaming Example
Click to expand code ```python import openai auth_token = "$AUTH_TOKEN" # your auth token openai.api_key = auth_token openai.base_url = "" completion = openai.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "What is artificial intelligence?"}] ) print(completion.choices[0].message.content) ```
--- ## Parameters for Chat Completions When using the chat completion feature of the vLLM Serverless Endpoint Worker, the following parameters are available:
Supported Chat Completions Inputs and Descriptions | Parameter | Type | Default | Description | | -------------------------- | ----- | ------- | -------------------------------------------------- | | model | str | – | The deployed model name on your vLLM endpoint. | | messages | list | – | A list of role/content pairs for the conversation. | | max_tokens | int | 16 | Max tokens to generate in response. | | temperature | float | 1.0 | Controls randomness of generation. | | top_p | float | 1.0 | Nucleus sampling parameter. | | stream | bool | False | Whether to enable streaming responses. | | stop | list | list | Stop generation on specified sequences. | | seed | int | None | Random seed. | | presence_penalty | float | 0.0 | Penalize token reuse. | | frequency_penalty | float | 0.0 | Penalize frequent words. | | repetition_penalty | float | 1.0 | Penalize repetition intensity. | | top_k | int | -1 | Consider top-k probable tokens. | | ignore_eos | bool | False | Ignore EOS token during generation. | | length_penalty | float | 1.0 | Control length influence in beam search. | | include_stop_str_in_output | bool | False | Whether to include stop tokens in output. |
--- ## Summary vLLM seamlessly integrates with OpenAI clients, offering both Completions and Chat APIs. It supports fine-tuning via parameter control and enables both synchronous and streaming responses. The setup provides a smooth developer experience while maintaining full compatibility with existing OpenAI-based tools. ---