Skip to main content

Performance Benchmarking of VLLM Inference Endpoints

Introduction

E2E Networks provides high-performance inference services on their TIR: AI/ML Platform, leveraging state-of-the-art hardware such as H200, H100 and A100 GPUs. This document presents the results of performance benchmarking conducted on various models using vllm inference. The benchmark aims to evaluate the Request Throughput, Output Token Throughput, Total Token Throughput, Mean Time per Output Token (TPOT), and Mean Inter-token Latency (ITL) across multiple models.

The benchmarking was conducted using vllm’s official benchmarking script, which can be found here: VLLM Benchmarking script.

The benchmark command used for each model is:

python3 benchmark_serving.py \
--backend vllm \
--base-url http://0.0.0.0:8080 \
--model model_name \
--request-rate inf \
--num-prompts 10000 \
--dataset-name random \
--random-input-len 200 \
--random-output-len 200

Benchmark Methodology

The inference service was evaluated for 10,000 requests, ensuring a comprehensive assessment of the system's behavior under high-load conditions. The metrics evaluated are:

  • Request Throughput (req/s): The number of successful requests processed per second.
  • Output Token Throughput (tok/s): The number of output tokens generated per second.
  • Total Token Throughput (tok/s): The combined number of input and output tokens processed per second.
  • Mean Time per Output Token (ms): The average time taken to generate each output token.
  • Mean Inter-token Latency (ms): The average delay between two consecutive tokens in a single request.

The results are detailed in the following sections for H200, H100 and A100 GPUs.

Performance Results

H200 GPU Performance

With the H200 GPUs offering increased GPU memory, users can utilize additional VLLM commands to enhance their inference performance. The following commands can be included while launching your inference:

["--block-size", "32", "--max-num-batched-tokens", "1024", "--max-num-seqs", "512"]

These parameters enable better handling of larger batches and sequences, allowing for more efficient processing and improved throughput.

ModelRequest Throughput (req/s)Output Token Throughput (tok/s)Total Token Throughput (tok/s)Mean Time per Output Token (ms)Mean Inter-token Latency (ms)
meta-llama/Llama-3.1-8B-Instruct36.753893.5722707.1151.3354.75
meta-llama/Llama-3.1-8B-Instruct (FP8)46.584912.3628758.9238.741.38
meta-llama/Llama-3.2-1B-Instruct68.566626.5441730.0349.9925.93
meta-llama/Llama-3.2-1B-Instruct (FP8)68.416812.8141836.7353.6428.54
meta-llama/Llama-3.2-3B-Instruct55.935873.5934507.6126.7325.4
meta-llama/Llama-3.2-3B-Instruct (FP8)56.695917.7934940.9126.2524.95
mistralai/Mistral-7B-Instruct-v0.156.328723.5420456.8735.8929.37
google/gemma-7b41.777581.316439.2645.8234.78
microsoft/Phi-3-mini-128k-instruct49.458019.6317672.5434.7832.41
Note

The benchmarking tests were conducted with an input length of 512 and an output length of 128.

H100 GPU Performance

ModelRequest Throughput (req/s)Output Token Throughput (tok/s)Total Token Throughput (tok/s)Mean Time per Output Token (ms)Mean Inter-token Latency (ms)
meta-llama/Llama-3.1-8B-Instruct49.077743.717556.8553.3343.37
meta-llama/Llama-3.1-8B-Instruct (FP8)53.18310.4218930.459.5746.97
meta-llama/Llama-3.2-1B-Instruct58.228923.1820567.991.4535.92
meta-llama/Llama-3.2-1B-Instruct (FP8)60.89135.9321296.9280.6434.29
meta-llama/Llama-3.2-3B-Instruct61.139062.6821289.4938.8435.66
meta-llama/Llama-3.2-3B-Instruct (FP8)63.419327.0522009.0436.7634.22
mistralai/Mistral-7B-Instruct-v0.141.436416.9514702.0742.4338.75
google/gemma-7b33.215637.4612280.08184.5241.47
microsoft/Phi-3-mini-128k-instruct36.47109.7914389.0334.8934.24
Note

The benchmarking tests were conducted with an input length of 200 and an output length of 200.

A100 GPU Performance

ModelRequest Throughput (req/s)Output Token Throughput (tok/s)Total Token Throughput (tok/s)Mean Time per Output Token (ms)Mean Inter-token Latency (ms)
meta-llama/Llama-3.1-8B-Instruct24.633881.28807.8278.2164.55
meta-llama/Llama-3.1-8B-Instruct (FP8)22.033473.987879.286.4172.92
meta-llama/Llama-3.2-1B-Instruct36.135541.7112768.21139.0657.24
meta-llama/Llama-3.2-1B-Instruct (FP8)39.776008.8413962.49117.4452.9
meta-llama/Llama-3.2-3B-Instruct35.875316.4812490.8767.5261.79
meta-llama/Llama-3.2-3B-Instruct (FP8)38.695630.1613367.5954.5550.28
mistralai/Mistral-7B-Instruct-v0.122.663515.998048.8779.2871.76
google/gemma-7b17.532982.416487.63336.7375.53
microsoft/Phi-3-mini-128k-instruct21.293984.58689.2683.6775.3
Note

The benchmarking tests were conducted with an input length of 200 and an output length of 200.

Conclusion

This benchmarking study demonstrates the ability of the vllm inference service to handle high-throughput tasks across various GPU types (H200, H100, and A100). The results provide a clear understanding of the system's capabilities, enabling users to make informed decisions on the optimal hardware for their specific AI/ML inference needs.

The H200 GPU stands out for its improved throughput performance across most models, thanks to better memory handling, which is essential for optimizing the overall inference process. Users looking to process larger batches or require lower latency may benefit from using the H200 over the other options.