Skip to main content

Inference Benchmarks

Introduction

E2E Networks provides high-performance LLM inference with TIR (an end-to-end AI/ML Platform) leveraging state-of-the-art GPU Infrastructure.

This page covers the methodology and performance highlights for latency and throughput of LLM Inference on GPUs such as H100, H200, A100 etc.

Serving Performance

Generative AI Applications often employ an inference engine for performing LLM inference. vLLM is one of the mostly widely used inference engine that supports features such as continuous batching, open AI compatibility, etc.

In this section, we evaluate the performance of vLLM with a single GPU (e.g. H100) at serving concurrent requests. The numbers obtained here are based on default configuration of vLLM to showcase a generic solution. Our team can help you further tune these numbers based on your specific use case.

  • Benchmarking Framework: vLLM official benchmarking script with random tokens dataset
  • Metrics:
    • Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.
    • Inter-Token Latency (ITL): The latency between each token after the first.
    • Total Latency: TTFT + (no of tokens * ITL)
    • Token Throughput: The total number of tokens generated per second
    • Request Throughput: The total number of request completed per second
  • Inputs:
    • Input & Output token lengths based on task type: for e.g. summarization often involves a large input and small output
    • Request rate: no of requests / second. we settled on this -after trial and error- based on best results for a combination of input/output tokens
  • Endpoint Protocol: HTTPS
LLAMA 3.1 (8B) + H200
ModelGPUsTask TypeInput-OutputRequest RateRequests/SecThroughput (tokens/sec)TTFT (ms)TPOT (ms)Total Latency (ms)
LLAMA3.1 (8B)1Simple Chat[128:128]6437.32895554.3716.092083.26
LLAMA3.1 (8B)1Summarization / RAG[2024:128]128.9519428204.5132.454325.66
LLAMA3.1 (8B)1Classification[1024:30]2419.032008889.0519.20645.85
LLAMA3.1 (8B)1Creative Writing[200:2024]41.32254324.329.3018838.22
LLAMA 3.1 (8B) + H100
ModelGPUsTask TypeInput - OutputRequest RateRequests/SecThroughputTime to First Token (TTFT)Time Per Output Token (TPOT)Total Latency
LLAMA3.1 (8B)1Simple Chat[128:128]6435.048399 tokens/sec28.84 ms22.27 ms2857.12 ms
LLAMA3.1 (8B)1Summarization / RAG[2024:128]128.9519127 tokens/sec200.06 ms35.51 ms4709.83 ms
LLAMA3.1 (8B)1Classification[1024:30]2419.0319979 tokens/sec92.74 ms22.07 ms732.77 ms
LLAMA3.1 (8B)1Creative Writing[200:2024]83.425060 tokens/sec47.25 ms21.57 ms43682.36 ms
LLAMA 3.1 (70B) + H200
ModelGPUsTask TypeInput - OutputRequest RateRequests/SecThroughputTime to First Token (TTFT)Time Per Output Token (TPOT)Total Latency
llama3.1 (70B)4Simple Chat[128:128]2415.33542 tokens/sec90.59 ms29.53 ms3840.9 ms
llama3.1 (70B)4Summarization / RAG[2024:128]43.456139 tokens/sec455.96 ms46.11 ms6311.93 ms
llama3.1 (70B)4Classification[1024:30]1077202 tokens/sec341.49 ms59.28 ms2060.61 ms
llama3.1 (70B)4Creative Writing[200:2024]82.352782 tokens/sec69.44 ms25.09 ms50826.51 ms
LLAMA 3.1 (70B) + H100
ModelGPUsTask TypeInput - OutputRequest RateRequests/SecThroughputTime to First Token (TTFT)Time Per Output Token (TPOT)Total Latency
llama3.1 (70B)4Simple Chat[128:128]24153438 tokens/sec84.84 ms30.63 ms3976.12 ms
llama3.1 (70B)4Summarization / RAG[2024:128]43.455884.31 tokens/sec461.72 ms48.43 ms6612.33 ms
llama3.1 (70B)4Classification[1024:30]106.957129 tokens/sec331.69 ms62.41 ms2141.58 ms
llama3.1 (70B)4Creative Writing[200:2024]82.042626.71 tokens/sec74.20 ms27.31 ms55322.33 ms

Offline Performance

While the serving inference over API often introduces network latency and concurrency overhead, in case of an offline or batch applications both factors are often controllable.

Consider an example of a video generation pipeline that runs in the background asynchronously and writes the generated video to an object storage bucket. In such scenarios, we observe that the raw token throughput is important metric to optimize for.

In this section, we benchmark LLM Inference in offline/batch applications that have local GPU access.

Benchmarking Framework: tensor-RT LLM Benchmarking tool

LLAMA 3.1 (8B)

ModelTP / GPUsInput - OutputH200H100H100H100A100
LLaMA 3.1 (8B)1128, 12830637.2927006.1615861.6316119.636051.62
128, 204820787.9320190.189536.499625.494910.03
128, 409612348.3512876.065357.985807.983157.45
500, 200019079.0215363.208043.107927.103897.62
1000, 100015807.4314555.287425.077513.073747.46
2048, 1283570.993095.411625.072000.07905.56
2048, 20489026.938632.424055.614128.611826.86
5000, 5003782.152979.611555.201784.20831.09
20000, 20001487.861503.69852.58528.5881.47

LLAMA 3.1 (70B)

ModelTP / GPUsInput - OutputH200 (FP8)H100 (FP8)H100 (mixed)
LLaMA 3.1 (70B)1128, 1283959.232633.65
128, 20481787.33709.73
128, 4096883.15
500, 20001144.49868.24
1000, 10001535.40646.07
2048, 128528.06325.44
2048, 2048722.55427.02
5000, 500394.32169.98
20000, 2000121.02
LLaMA 3.1 (70B)4128, 12810226.196085.74
128, 204811155.975860.27
128, 40967454.893400.03
500, 20009670.334419.09
1000, 10007081.354141.52
2048, 1281558.33772.81
2048, 20484777.412650.39
5000, 500987.63787.96
20000, 2000720.06261.36

Benchmarking Kit

vLLM Scripts

  1. Clone Repo
$ git clone https://github.com/vllm-project/vllm
  1. Start VLLM from terminal
$ pip install vllm 
$ vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --port 8080 --tensor-parallel-size <gpu-count>
  1. Start the script from another terminal
$ cd vllm/benchmarks
$ python3 benchmark_serving.py --backend openai --host localhost --port 8080 --dataset-name=random --random-input-len=<token-size> --random-output-len=<token-size> --model <model-name> --num-prompts 200 --request-rate <request-rate>

Tensor-RT LLM Scripts

  1. Create a node in TIR with image Tensor-RT LLM Builder. Click on Jupyter labs URL
  2. Open terminal to clone repo:
$ git clone https://github.com/NVIDIA/TensorRT-LLM
$ cd TensorRT-LLM
  1. Create dataset
$ python benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file
  1. Build a tensor-rt engine
$ trtllm-bench --model $model_name build --tp_size $tp_size --pp_size $pp_size --quantization FP8 --dataset $dataset_file
  1. Run a benchmark with a dataset
$ trtllm-bench --model $model_name throughput --dataset $dataset_file --engine_dir $engine_dir