# VLLM with OpenAI Client

vLLM provides an HTTP server that implements OpenAI’s **Completions** and **Chat API**, allowing seamless integration with existing OpenAI-compatible tools.

---

## Using OpenAI Completions API with vLLM

Since this server is compatible with the OpenAI API, you can use it as a **drop-in replacement** for any application built on the OpenAI client. Below is an example of querying the vLLM server using the OpenAI Python package.

<details>
<summary>Click to expand code</summary>

```python
import openai

auth_token = "$AUTH_TOKEN"  # put your auth token here...

openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"

completion = openai.Completion.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    prompt="San Francisco is a"
)

print("Completion result:", completion)
```

</details>

---

## Parameters

When using the completion feature of the vLLM Serverless Endpoint Worker, you can customize requests using the parameters listed below.

<details>
<summary>Supported Completions inputs and descriptions</summary>

| Parameter         | Type       | Default | Description                                           |
| ----------------- | ---------- | ------- | ----------------------------------------------------- |
| model             | str        | –       | The model repo deployed on your vLLM endpoint.        |
| prompt            | str / list | –       | The input prompt text or tokens for generation.       |
| suffix            | str        | None    | Text appended after the generated sequence.           |
| max_tokens        | int        | 16      | Max number of tokens to generate.                     |
| temperature       | float      | 1.0     | Controls randomness of output. Lower = deterministic. |
| top_p             | float      | 1.0     | Nucleus sampling threshold (0–1).                     |
| n                 | int        | 1       | Number of outputs to return.                          |
| stream            | bool       | False   | Whether to stream output responses.                   |
| logprobs          | int        | None    | Number of log probabilities per output token.         |
| echo              | bool       | False   | Echo prompt in the completion output.                 |
| stop              | list / str | list    | Tokens or strings to stop generation.                 |
| seed              | int        | None    | Random seed for reproducibility.                      |
| presence_penalty  | float      | 0.0     | Penalizes tokens already used.                        |
| frequency_penalty | float      | 0.0     | Penalizes frequent tokens.                            |
| best_of           | int        | None    | Generate multiple candidates and return top n.        |
| logit_bias        | dict       | None    | Bias generation toward/against specific tokens.       |
| user              | str        | None    | Optional user identifier.                             |

**Additional Parameters**

| Parameter                  | Type  | Default | Description                          |
| -------------------------- | ----- | ------- | ------------------------------------ |
| top_k                      | int   | -1      | Consider top-k tokens.               |
| ignore_eos                 | bool  | False   | Ignore end-of-sentence tokens.       |
| use_beam_search            | bool  | False   | Enable beam search generation.       |
| stop_token_ids             | list  | list    | Token IDs to halt generation.        |
| skip_special_tokens        | bool  | True    | Skip special tokens in output.       |
| repetition_penalty         | float | 1.0     | Penalize repetition.                 |
| length_penalty             | float | 1.0     | Penalize long sequences.             |
| min_p                      | float | 0.0     | Minimum relative probability cutoff. |
| include_stop_str_in_output | bool  | False   | Include stop strings in final text.  |

</details>

---

## Using OpenAI Chat API with vLLM (Streaming and Non-Streaming)

The vLLM server supports the **OpenAI Chat API**, allowing interactive, context-aware conversation generation. You can use either **streaming** or **non-streaming** modes.

### Python3 Streaming Example

<details>
<summary>Click to expand code</summary>

```python
import openai

auth_token = "$AUTH_TOKEN"  # your auth token
openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"

streamer = openai.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "What are large language models?"}],
    stream=True
)

for chunk in streamer:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

</details>

### Python3 Non-Streaming Example

<details>
<summary>Click to expand code</summary>

```python
import openai

auth_token = "$AUTH_TOKEN"  # your auth token
openai.api_key = auth_token
openai.base_url = "<your-vllm-endpoint-url>"

completion = openai.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "What is artificial intelligence?"}]
)

print(completion.choices[0].message.content)
```

</details>

---

## Parameters for Chat Completions

When using the chat completion feature of the vLLM Serverless Endpoint Worker, the following parameters are available:

<details>
<summary>Supported Chat Completions Inputs and Descriptions</summary>

| Parameter                  | Type  | Default | Description                                        |
| -------------------------- | ----- | ------- | -------------------------------------------------- |
| model                      | str   | –       | The deployed model name on your vLLM endpoint.     |
| messages                   | list  | –       | A list of role/content pairs for the conversation. |
| max_tokens                 | int   | 16      | Max tokens to generate in response.                |
| temperature                | float | 1.0     | Controls randomness of generation.                 |
| top_p                      | float | 1.0     | Nucleus sampling parameter.                        |
| stream                     | bool  | False   | Whether to enable streaming responses.             |
| stop                       | list  | list    | Stop generation on specified sequences.            |
| seed                       | int   | None    | Random seed.                                       |
| presence_penalty           | float | 0.0     | Penalize token reuse.                              |
| frequency_penalty          | float | 0.0     | Penalize frequent words.                           |
| repetition_penalty         | float | 1.0     | Penalize repetition intensity.                     |
| top_k                      | int   | -1      | Consider top-k probable tokens.                    |
| ignore_eos                 | bool  | False   | Ignore EOS token during generation.                |
| length_penalty             | float | 1.0     | Control length influence in beam search.           |
| include_stop_str_in_output | bool  | False   | Whether to include stop tokens in output.          |

</details>

---

## Summary

vLLM seamlessly integrates with OpenAI clients, offering both Completions and Chat APIs. It supports fine-tuning via parameter control and enables both synchronous and streaming responses. The setup provides a smooth developer experience while maintaining full compatibility with existing OpenAI-based tools.


---