Inference Benchmarks
Introduction
E2E Networks provides high-performance LLM inference with TIR (an end-to-end AI/ML Platform) leveraging state-of-the-art GPU Infrastructure.
This page covers the methodology and performance highlights for latency and throughput of LLM Inference on GPUs such as H100, H200, A100 etc.
Serving Performance
Generative AI Applications often employ an inference engine for performing LLM inference. vLLM is one of the mostly widely used inference engine that supports features such as continuous batching, open AI compatibility, etc.
In this section, we evaluate the performance of vLLM with a single GPU (e.g. H100) at serving concurrent requests. The numbers obtained here are based on default configuration of vLLM to showcase a generic solution. Our team can help you further tune these numbers based on your specific use case.
- Benchmarking Framework: vLLM official benchmarking script with random tokens dataset
- Metrics:
- Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.
- Inter-Token Latency (ITL): The latency between each token after the first.
- Total Latency: TTFT + (no of tokens * ITL)
- Token Throughput: The total number of tokens generated per second
- Request Throughput: The total number of request completed per second
- Inputs:
- Input & Output token lengths based on task type: for e.g. summarization often involves a large input and small output
- Request rate: no of requests / second. we settled on this -after trial and error- based on best results for a combination of input/output tokens
- Endpoint Protocol: HTTPS
LLAMA 3.1 (8B) + H200
Model | GPUs | Task Type | Input-Output | Request Rate | Requests/Sec | Throughput (tokens/sec) | TTFT (ms) | TPOT (ms) | Total Latency (ms) |
---|---|---|---|---|---|---|---|---|---|
LLAMA3.1 (8B) | 1 | Simple Chat | [128:128] | 64 | 37.32 | 8955 | 54.37 | 16.09 | 2083.26 |
LLAMA3.1 (8B) | 1 | Summarization / RAG | [2024:128] | 12 | 8.95 | 19428 | 204.51 | 32.45 | 4325.66 |
LLAMA3.1 (8B) | 1 | Classification | [1024:30] | 24 | 19.03 | 20088 | 89.05 | 19.20 | 645.85 |
LLAMA3.1 (8B) | 1 | Creative Writing | [200:2024] | 4 | 1.32 | 2543 | 24.32 | 9.30 | 18838.22 |
LLAMA 3.1 (8B) + H100
Model | GPUs | Task Type | Input - Output | Request Rate | Requests/Sec | Throughput | Time to First Token (TTFT) | Time Per Output Token (TPOT) | Total Latency |
---|---|---|---|---|---|---|---|---|---|
LLAMA3.1 (8B) | 1 | Simple Chat | [128:128] | 64 | 35.04 | 8399 tokens/sec | 28.84 ms | 22.27 ms | 2857.12 ms |
LLAMA3.1 (8B) | 1 | Summarization / RAG | [2024:128] | 12 | 8.95 | 19127 tokens/sec | 200.06 ms | 35.51 ms | 4709.83 ms |
LLAMA3.1 (8B) | 1 | Classification | [1024:30] | 24 | 19.03 | 19979 tokens/sec | 92.74 ms | 22.07 ms | 732.77 ms |
LLAMA3.1 (8B) | 1 | Creative Writing | [200:2024] | 8 | 3.42 | 5060 tokens/sec | 47.25 ms | 21.57 ms | 43682.36 ms |
LLAMA 3.1 (70B) + H200
Model | GPUs | Task Type | Input - Output | Request Rate | Requests/Sec | Throughput | Time to First Token (TTFT) | Time Per Output Token (TPOT) | Total Latency |
---|---|---|---|---|---|---|---|---|---|
llama3.1 (70B) | 4 | Simple Chat | [128:128] | 24 | 15.3 | 3542 tokens/sec | 90.59 ms | 29.53 ms | 3840.9 ms |
llama3.1 (70B) | 4 | Summarization / RAG | [2024:128] | 4 | 3.45 | 6139 tokens/sec | 455.96 ms | 46.11 ms | 6311.93 ms |
llama3.1 (70B) | 4 | Classification | [1024:30] | 10 | 7 | 7202 tokens/sec | 341.49 ms | 59.28 ms | 2060.61 ms |
llama3.1 (70B) | 4 | Creative Writing | [200:2024] | 8 | 2.35 | 2782 tokens/sec | 69.44 ms | 25.09 ms | 50826.51 ms |
LLAMA 3.1 (70B) + H100
Model | GPUs | Task Type | Input - Output | Request Rate | Requests/Sec | Throughput | Time to First Token (TTFT) | Time Per Output Token (TPOT) | Total Latency |
---|---|---|---|---|---|---|---|---|---|
llama3.1 (70B) | 4 | Simple Chat | [128:128] | 24 | 15 | 3438 tokens/sec | 84.84 ms | 30.63 ms | 3976.12 ms |
llama3.1 (70B) | 4 | Summarization / RAG | [2024:128] | 4 | 3.45 | 5884.31 tokens/sec | 461.72 ms | 48.43 ms | 6612.33 ms |
llama3.1 (70B) | 4 | Classification | [1024:30] | 10 | 6.95 | 7129 tokens/sec | 331.69 ms | 62.41 ms | 2141.58 ms |
llama3.1 (70B) | 4 | Creative Writing | [200:2024] | 8 | 2.04 | 2626.71 tokens/sec | 74.20 ms | 27.31 ms | 55322.33 ms |
Offline Performance
While the serving inference over API often introduces network latency and concurrency overhead, in case of an offline or batch applications both factors are often controllable.
Consider an example of a video generation pipeline that runs in the background asynchronously and writes the generated video to an object storage bucket. In such scenarios, we observe that the raw token throughput is important metric to optimize for.
In this section, we benchmark LLM Inference in offline/batch applications that have local GPU access.
Benchmarking Framework: tensor-RT LLM Benchmarking tool
LLAMA 3.1 (8B)
Model | TP / GPUs | Input - Output | H200 | H100 | H100 | H100 | A100 |
---|---|---|---|---|---|---|---|
LLaMA 3.1 (8B) | 1 | 128, 128 | 30637.29 | 27006.16 | 15861.63 | 16119.63 | 6051.62 |
128, 2048 | 20787.93 | 20190.18 | 9536.49 | 9625.49 | 4910.03 | ||
128, 4096 | 12348.35 | 12876.06 | 5357.98 | 5807.98 | 3157.45 | ||
500, 2000 | 19079.02 | 15363.20 | 8043.10 | 7927.10 | 3897.62 | ||
1000, 1000 | 15807.43 | 14555.28 | 7425.07 | 7513.07 | 3747.46 | ||
2048, 128 | 3570.99 | 3095.41 | 1625.07 | 2000.07 | 905.56 | ||
2048, 2048 | 9026.93 | 8632.42 | 4055.61 | 4128.61 | 1826.86 | ||
5000, 500 | 3782.15 | 2979.61 | 1555.20 | 1784.20 | 831.09 | ||
20000, 2000 | 1487.86 | 1503.69 | 852.58 | 528.58 | 81.47 |
LLAMA 3.1 (70B)
Model | TP / GPUs | Input - Output | H200 (FP8) | H100 (FP8) | H100 (mixed) |
---|---|---|---|---|---|
LLaMA 3.1 (70B) | 1 | 128, 128 | 3959.23 | 2633.65 | |
128, 2048 | 1787.33 | 709.73 | |||
128, 4096 | 883.15 | ||||
500, 2000 | 1144.49 | 868.24 | |||
1000, 1000 | 1535.40 | 646.07 | |||
2048, 128 | 528.06 | 325.44 | |||
2048, 2048 | 722.55 | 427.02 | |||
5000, 500 | 394.32 | 169.98 | |||
20000, 2000 | 121.02 | ||||
LLaMA 3.1 (70B) | 4 | 128, 128 | 10226.19 | 6085.74 | |
128, 2048 | 11155.97 | 5860.27 | |||
128, 4096 | 7454.89 | 3400.03 | |||
500, 2000 | 9670.33 | 4419.09 | |||
1000, 1000 | 7081.35 | 4141.52 | |||
2048, 128 | 1558.33 | 772.81 | |||
2048, 2048 | 4777.41 | 2650.39 | |||
5000, 500 | 987.63 | 787.96 | |||
20000, 2000 | 720.06 | 261.36 |
Benchmarking Kit
vLLM Scripts
- Clone Repo
$ git clone https://github.com/vllm-project/vllm
- Start VLLM from terminal
$ pip install vllm
$ vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --port 8080 --tensor-parallel-size <gpu-count>
- Start the script from another terminal
$ cd vllm/benchmarks
$ python3 benchmark_serving.py --backend openai --host localhost --port 8080 --dataset-name=random --random-input-len=<token-size> --random-output-len=<token-size> --model <model-name> --num-prompts 200 --request-rate <request-rate>
Tensor-RT LLM Scripts
- Create a node in TIR with image
Tensor-RT LLM Builder
. Click on Jupyter labs URL - Open terminal to clone repo:
$ git clone https://github.com/NVIDIA/TensorRT-LLM
$ cd TensorRT-LLM
- Create dataset
$ python benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file
- Build a tensor-rt engine
$ trtllm-bench --model $model_name build --tp_size $tp_size --pp_size $pp_size --quantization FP8 --dataset $dataset_file
- Run a benchmark with a dataset
$ trtllm-bench --model $model_name throughput --dataset $dataset_file --engine_dir $engine_dir