Inference Benchmarks

Introduction

E2E Networks provides high-performance LLM inference with TIR (an end-to-end AI/ML Platform) leveraging state-of-the-art GPU Infrastructure.

This page covers the methodology and performance highlights for latency and throughput of LLM Inference on GPUs such as H100, H200, A100 etc.

Serving Performance

Generative AI Applications often employ an inference engine for performing LLM inference. vLLM is one of the mostly widely used inference engine that supports features such as continuous batching, open AI compatibility, etc.

In this section, we evaluate the performance of vLLM with a single GPU (e.g. H100) at serving concurrent requests. The numbers obtained here are based on default configuration of vLLM to showcase a generic solution. Our team can help you further tune these numbers based on your specific use case.

Benchmarking Framework: vLLM official benchmarking script with random tokens dataset
Metrics:
- Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.
- Inter-Token Latency (ITL): The latency between each token after the first.
- Total Latency: TTFT + (no of tokens * ITL)
- Token Throughput: The total number of tokens generated per second
- Request Throughput: The total number of request completed per second
Inputs:
- Input & Output token lengths based on task type: for e.g. summarization often involves a large input and small output
- Request rate: no of requests / second. we settled on this -after trial and error- based on best results for a combination of input/output tokens
Endpoint Protocol: HTTPS

LLAMA 3.1 (8B) + H200

Model	GPUs	Task Type	Input-Output	Request Rate	Requests/Sec	Throughput (tokens/sec)	TTFT (ms)	TPOT (ms)	Total Latency (ms)
LLAMA3.1 (8B)	1	Simple Chat	[128:128]	64	37.32	8955	54.37	16.09	2083.26
LLAMA3.1 (8B)	1	Summarization / RAG	[2024:128]	12	8.95	19428	204.51	32.45	4325.66
LLAMA3.1 (8B)	1	Classification	[1024:30]	24	19.03	20088	89.05	19.20	645.85
LLAMA3.1 (8B)	1	Creative Writing	[200:2024]	4	1.32	2543	24.32	9.30	18838.22

LLAMA 3.1 (8B) + H100

Model	GPUs	Task Type	Input - Output	Request Rate	Requests/Sec	Throughput	Time to First Token (TTFT)	Time Per Output Token (TPOT)	Total Latency
LLAMA3.1 (8B)	1	Simple Chat	[128:128]	64	35.04	8399 tokens/sec	28.84 ms	22.27 ms	2857.12 ms
LLAMA3.1 (8B)	1	Summarization / RAG	[2024:128]	12	8.95	19127 tokens/sec	200.06 ms	35.51 ms	4709.83 ms
LLAMA3.1 (8B)	1	Classification	[1024:30]	24	19.03	19979 tokens/sec	92.74 ms	22.07 ms	732.77 ms
LLAMA3.1 (8B)	1	Creative Writing	[200:2024]	8	3.42	5060 tokens/sec	47.25 ms	21.57 ms	43682.36 ms

LLAMA 3.1 (70B) + H200

Model	GPUs	Task Type	Input - Output	Request Rate	Requests/Sec	Throughput	Time to First Token (TTFT)	Time Per Output Token (TPOT)	Total Latency
llama3.1 (70B)	4	Simple Chat	[128:128]	24	15.3	3542 tokens/sec	90.59 ms	29.53 ms	3840.9 ms
llama3.1 (70B)	4	Summarization / RAG	[2024:128]	4	3.45	6139 tokens/sec	455.96 ms	46.11 ms	6311.93 ms
llama3.1 (70B)	4	Classification	[1024:30]	10	7	7202 tokens/sec	341.49 ms	59.28 ms	2060.61 ms
llama3.1 (70B)	4	Creative Writing	[200:2024]	8	2.35	2782 tokens/sec	69.44 ms	25.09 ms	50826.51 ms

LLAMA 3.1 (70B) + H100

Model	GPUs	Task Type	Input - Output	Request Rate	Requests/Sec	Throughput	Time to First Token (TTFT)	Time Per Output Token (TPOT)	Total Latency
llama3.1 (70B)	4	Simple Chat	[128:128]	24	15	3438 tokens/sec	84.84 ms	30.63 ms	3976.12 ms
llama3.1 (70B)	4	Summarization / RAG	[2024:128]	4	3.45	5884.31 tokens/sec	461.72 ms	48.43 ms	6612.33 ms
llama3.1 (70B)	4	Classification	[1024:30]	10	6.95	7129 tokens/sec	331.69 ms	62.41 ms	2141.58 ms
llama3.1 (70B)	4	Creative Writing	[200:2024]	8	2.04	2626.71 tokens/sec	74.20 ms	27.31 ms	55322.33 ms

Offline Performance

While the serving inference over API often introduces network latency and concurrency overhead, in case of an offline or batch applications both factors are often controllable.

Consider an example of a video generation pipeline that runs in the background asynchronously and writes the generated video to an object storage bucket. In such scenarios, we observe that the raw token throughput is important metric to optimize for.

In this section, we benchmark LLM Inference in offline/batch applications that have local GPU access.

Benchmarking Framework: tensor-RT LLM Benchmarking tool

LLAMA 3.1 (8B)

Model	TP / GPUs	Input - Output	H200	H100	H100	H100	A100
LLaMA 3.1 (8B)	1	128, 128	30637.29	27006.16	15861.63	16119.63	6051.62
		128, 2048	20787.93	20190.18	9536.49	9625.49	4910.03
		128, 4096	12348.35	12876.06	5357.98	5807.98	3157.45
		500, 2000	19079.02	15363.20	8043.10	7927.10	3897.62
		1000, 1000	15807.43	14555.28	7425.07	7513.07	3747.46
		2048, 128	3570.99	3095.41	1625.07	2000.07	905.56
		2048, 2048	9026.93	8632.42	4055.61	4128.61	1826.86
		5000, 500	3782.15	2979.61	1555.20	1784.20	831.09
		20000, 2000	1487.86	1503.69	852.58	528.58	81.47

LLAMA 3.1 (70B)

Model	TP / GPUs	Input - Output	H200 (FP8)	H100 (FP8)	H100 (mixed)
LLaMA 3.1 (70B)	1	128, 128	3959.23	2633.65
		128, 2048	1787.33	709.73
		128, 4096	883.15
		500, 2000	1144.49	868.24
		1000, 1000	1535.40	646.07
		2048, 128	528.06	325.44
		2048, 2048	722.55	427.02
		5000, 500	394.32	169.98
		20000, 2000	121.02
LLaMA 3.1 (70B)	4	128, 128		10226.19	6085.74
		128, 2048		11155.97	5860.27
		128, 4096		7454.89	3400.03
		500, 2000		9670.33	4419.09
		1000, 1000		7081.35	4141.52
		2048, 128		1558.33	772.81
		2048, 2048		4777.41	2650.39
		5000, 500		987.63	787.96
		20000, 2000		720.06	261.36

Benchmarking Kit

vLLM Scripts

Clone Repo

$ git clone https://github.com/vllm-project/vllm

Start VLLM from terminal

$ pip install vllm 
$ vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --port 8080  --tensor-parallel-size <gpu-count>

Start the script from another terminal

$ cd vllm/benchmarks
$ python3 benchmark_serving.py --backend openai  --host localhost --port 8080 --dataset-name=random --random-input-len=<token-size> --random-output-len=<token-size> --model <model-name>  --num-prompts 200  --request-rate <request-rate>

Tensor-RT LLM Scripts

Create a node in TIR with image Tensor-RT LLM Builder. Click on Jupyter labs URL
Open terminal to clone repo:

$ git clone https://github.com/NVIDIA/TensorRT-LLM
$ cd TensorRT-LLM

Create dataset

$ python benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file

Build a tensor-rt engine

$ trtllm-bench --model $model_name build --tp_size $tp_size --pp_size $pp_size --quantization FP8 --dataset $dataset_file

Run a benchmark with a dataset

$ trtllm-bench --model $model_name throughput --dataset $dataset_file --engine_dir $engine_dir

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

Introduction​

Serving Performance​

LLAMA 3.1 (8B) + H200​

LLAMA 3.1 (8B) + H100​

LLAMA 3.1 (70B) + H200​

LLAMA 3.1 (70B) + H100​

Offline Performance​

LLAMA 3.1 (8B)​

LLAMA 3.1 (70B)​

Benchmarking Kit​

vLLM Scripts​

Tensor-RT LLM Scripts​

Introduction

Serving Performance

LLAMA 3.1 (8B) + H200

LLAMA 3.1 (8B) + H100

LLAMA 3.1 (70B) + H200

LLAMA 3.1 (70B) + H100

Offline Performance

LLAMA 3.1 (8B)

LLAMA 3.1 (70B)

Benchmarking Kit

vLLM Scripts

Tensor-RT LLM Scripts