Performance Benchmarking of VLLM Inference Endpoints
Introduction
E2E Networks provides high-performance inference services on their TIR: AI/ML Platform, leveraging state-of-the-art hardware such as H200, H100 and A100 GPUs. This document presents the results of performance benchmarking conducted on various models using vllm inference. The benchmark aims to evaluate the Request Throughput, Output Token Throughput, Total Token Throughput, Mean Time per Output Token (TPOT), and Mean Inter-token Latency (ITL) across multiple models.
The benchmarking was conducted using vllm’s official benchmarking script, which can be found here: VLLM Benchmarking script.
The benchmark command used for each model is:
python3 benchmark_serving.py \
--backend vllm \
--base-url http://0.0.0.0:8080 \
--model model_name \
--request-rate inf \
--num-prompts 10000 \
--dataset-name random \
--random-input-len 200 \
--random-output-len 200
Benchmark Methodology
The inference service was evaluated for 10,000 requests, ensuring a comprehensive assessment of the system’s behavior under high-load conditions. The metrics evaluated are:
Request Throughput (req/s): The number of successful requests processed per second.
Output Token Throughput (tok/s): The number of output tokens generated per second.
Total Token Throughput (tok/s): The combined number of input and output tokens processed per second.
Mean Time per Output Token (ms): The average time taken to generate each output token.
Mean Inter-token Latency (ms): The average delay between two consecutive tokens in a single request.
The results are detailed in the following sections for H200, H100 and A100 GPUs.
Performance Results
H200 GPU Performance
With the H200 GPUs offering increased GPU memory, users can utilize additional VLLM commands to enhance their inference performance. The following commands can be included while launching your inference:
["--block-size", "32", "--max-num-batched-tokens", "1024", "--max-num-seqs", "512"]
These parameters enable better handling of larger batches and sequences, allowing for more efficient processing and improved throughput.
Model |
Request Throughput (req/s) |
Output Token Throughput (tok/s) |
Total Token Throughput (tok/s) |
Mean Time per Output Token (ms) |
Mean Inter-token Latency (ms) |
---|---|---|---|---|---|
meta-llama/Llama-3.1-8B-Instruct |
36.75 |
3893.57 |
22707.11 |
51.33 |
54.75 |
meta-llama/Llama-3.1-8B-Instruct (FP8) |
46.58 |
4912.36 |
28758.92 |
38.7 |
41.38 |
meta-llama/Llama-3.2-1B-Instruct |
68.56 |
6626.54 |
41730.03 |
49.99 |
25.93 |
meta-llama/Llama-3.2-1B-Instruct (FP8) |
68.41 |
6812.81 |
41836.73 |
53.64 |
28.54 |
meta-llama/Llama-3.2-3B-Instruct |
55.93 |
5873.59 |
34507.61 |
26.73 |
25.4 |
meta-llama/Llama-3.2-3B-Instruct (FP8) |
56.69 |
5917.79 |
34940.91 |
26.25 |
24.95 |
mistralai/Mistral-7B-Instruct-v0.1 |
56.32 |
8723.54 |
20456.87 |
35.89 |
29.37 |
google/gemma-7b |
41.77 |
7581.3 |
16439.26 |
45.82 |
34.78 |
microsoft/Phi-3-mini-128k-instruct |
49.45 |
8019.63 |
17672.54 |
34.78 |
32.41 |
Note
The benchmarking tests were conducted with an input length of 512 and an output length of 128.
H100 GPU Performance
Model |
Request Throughput (req/s) |
Output Token Throughput (tok/s) |
Total Token Throughput (tok/s) |
Mean Time per Output Token (ms) |
Mean Inter-token Latency (ms) |
---|---|---|---|---|---|
meta-llama/Llama-3.1-8B-Instruct |
49.07 |
7743.7 |
17556.85 |
53.33 |
43.37 |
meta-llama/Llama-3.1-8B-Instruct (FP8) |
53.1 |
8310.42 |
18930.4 |
59.57 |
46.97 |
meta-llama/Llama-3.2-1B-Instruct |
58.22 |
8923.18 |
20567.9 |
91.45 |
35.92 |
meta-llama/Llama-3.2-1B-Instruct (FP8) |
60.8 |
9135.93 |
21296.92 |
80.64 |
34.29 |
meta-llama/Llama-3.2-3B-Instruct |
61.13 |
9062.68 |
21289.49 |
38.84 |
35.66 |
meta-llama/Llama-3.2-3B-Instruct (FP8) |
63.41 |
9327.05 |
22009.04 |
36.76 |
34.22 |
mistralai/Mistral-7B-Instruct-v0.1 |
41.43 |
6416.95 |
14702.07 |
42.43 |
38.75 |
google/gemma-7b |
33.21 |
5637.46 |
12280.08 |
184.52 |
41.47 |
microsoft/Phi-3-mini-128k-instruct |
36.4 |
7109.79 |
14389.03 |
34.89 |
34.24 |
Note
The benchmarking tests were conducted with an input length of 200 and an output length of 200.
A100 GPU Performance
Model |
Request Throughput (req/s) |
Output Token Throughput (tok/s) |
Total Token Throughput (tok/s) |
Mean Time per Output Token (ms) |
Mean Inter-token Latency (ms) |
---|---|---|---|---|---|
meta-llama/Llama-3.1-8B-Instruct |
24.63 |
3881.2 |
8807.82 |
78.21 |
64.55 |
meta-llama/Llama-3.1-8B-Instruct (FP8) |
22.03 |
3473.98 |
7879.2 |
86.41 |
72.92 |
meta-llama/Llama-3.2-1B-Instruct |
36.13 |
5541.71 |
12768.21 |
139.06 |
57.24 |
meta-llama/Llama-3.2-1B-Instruct (FP8) |
39.77 |
6008.84 |
13962.49 |
117.44 |
52.9 |
meta-llama/Llama-3.2-3B-Instruct |
35.87 |
5316.48 |
12490.87 |
67.52 |
61.79 |
meta-llama/Llama-3.2-3B-Instruct (FP8) |
38.69 |
5630.16 |
13367.59 |
54.55 |
50.28 |
mistralai/Mistral-7B-Instruct-v0.1 |
22.66 |
3515.99 |
8048.87 |
79.28 |
71.76 |
google/gemma-7b |
17.53 |
2982.41 |
6487.63 |
336.73 |
75.53 |
microsoft/Phi-3-mini-128k-instruct |
21.29 |
4159.58 |
8416.83 |
58 |
58.11 |
Note
The benchmarking tests were conducted with an input length of 200 and an output length of 200.
Key Insights
From the performance results of different models across H100, H200, and A100 GPUs, several key insights can be drawn:
H200 GPUs demonstrate the highest performance among the three GPU types, offering exceptional Request Throughput and Output Token Throughput. Models like meta-llama/Llama-3.2-3B-Instruct and mistralai/Mistral-7B-Instruct achieve superior throughput on H200s, making them ideal for demanding applications.
H100 GPUs consistently outperform A100 GPUs in both Request Throughput and Output Token Throughput. Models such as meta-llama/Llama-3.2-3B-Instruct and mistralai/Mistral-7B-Instruct show higher throughput on H100s compared to A100s, though they fall slightly behind H200s.
Inter-token Latency (ITL) is significantly lower on H200 GPUs, followed by H100 GPUs, resulting in faster token generation and quicker responses. This makes H200s the optimal choice for applications requiring minimal latency.
FP8 Precision contributes to a notable increase in Output Token Throughput, especially evident in models like meta-llama/Llama-3.2-3B-Instruct (FP8). This precision mode is ideal for high-performance, low-latency applications and is effectively utilized by H100 and H200 GPUs.
A100 GPUs offer reliable performance, despite slightly lower throughput compared to H100 and H200 GPUs. Models such as meta-llama/Llama-3.2-1B-Instruct maintain steady throughput on A100 GPUs, making them a solid choice for consistent performance in environments where cost-efficiency is a priority.
Conclusion
The benchmarking results demonstrate that E2E Networks’ inference services on the TIR: AI/ML Platform are at the forefront of performance and innovation. As the first provider in India to offer H200 GPUs, E2E Networks ensures that clients can access the highest levels of Request Throughput and Output Token Throughput for demanding AI/ML applications. The integration of H200 GPUs, alongside H100 GPUs, positions our platform as the leader in providing cutting-edge performance, with FP8 Precision models further boosting throughput and reducing latency, making them ideal for real-time AI scenarios. With the latest hardware and VLLM technology, E2E Networks continues to deliver superior results, empowering clients to excel in their AI/ML tasks.