Serve LLM Inference

vLLM and Text Generation Inference (TGI) are the two production-grade LLM inference servers in common use on E2E GPU nodes. Both serve an OpenAI-compatible HTTP API, both ship as Docker images with CUDA pre-installed, and both support batched generation, paged attention, and quantization.

This guide shows the minimal pattern for each. For container basics, see Run GPU Workloads in Docker.

Pick a Card vLLM TGI Expose API Sizing

Pick a Card

The card has to fit the model weights plus the KV cache for your maximum context length and concurrency.

Model size	Minimum card (FP16)	Comfortable card
7B	L4 24 GB (tight), A30 24 GB	L40S 48 GB, A100 40 GB
13B	L40S 48 GB, A100 40 GB	A100 80 GB, H100 80 GB
34B	A100 80 GB, H100 80 GB	2× A100 80 GB, 2× H100
70B	2× A100 80 GB, 2× H100 80 GB	4× A100 80 GB, 4× H100
70B with long context (>16K)	4× A100 80 GB	4× H100 (FP8 cuts memory by ~50%)

Sharding across multiple cards costs latency. For latency-sensitive workloads, prefer a card whose memory comfortably fits the model on its own.

For full card comparisons, see Choose a GPU Card.

Serve a Model with vLLM

vLLM is the highest-throughput option for batched generation on most GPUs. The image exposes an OpenAI-compatible API on port 8000.

docker run -d --name vllm \
  --gpus all \
  --restart unless-stopped \
  -p 127.0.0.1:8000:8000 \
  -v /data/models:/root/.cache/huggingface \
  -e HF_TOKEN=<your-token> \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192

For multi-GPU tensor parallelism:

docker run -d --name vllm \
  --gpus all \
  --restart unless-stopped \
  --shm-size=16g \
  -p 127.0.0.1:8000:8000 \
  -v /data/models:/root/.cache/huggingface \
  -e HF_TOKEN=<your-token> \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192

Test it from the same node:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "Hello", "max_tokens": 32}'

Serve a Model with TGI

Hugging Face TGI is the alternative when you want first-class support for streaming, safetensors loading, and Hugging Face Hub integration. The image exposes the API on port 80 inside the container.

docker run -d --name tgi \
  --gpus all \
  --restart unless-stopped \
  --shm-size=16g \
  -p 127.0.0.1:8080:80 \
  -v /data/models:/data \
  -e HF_TOKEN=<your-token> \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --max-input-length 4096 \
  --max-total-tokens 8192

For multi-GPU sharding, add --num-shard <N> and use a multi-card plan.

Test it:

curl http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"inputs": "Hello", "parameters": {"max_new_tokens": 32}}'

Exposing the API

Both servers in the patterns above bind to 127.0.0.1. That is intentional - do not open the inference port directly to the internet.

Recommended ways to expose the API to clients:

Path	When to use it
SSH local-forward tunnel (`ssh -NL 8000:localhost:8000 root@<node-ip>`)	Personal access, demos, internal use. No port exposed to the internet.
Application Load Balancer in front of the node	Production traffic. Adds TLS, rate limiting, and request logging.
Same-VPC client (private IP only)	Internal services calling the model from another node in the same VPC.

If you must expose the port directly, restrict the security group to specific source IPs - never 0.0.0.0/0 on an inference endpoint.

Sizing Reminders

Pick a card whose memory fits the model weights plus the KV cache for your maximum context length and concurrency.
Watch GPU used memory in the Monitoring tab. If it stays near 100%, lower --max-model-len or reduce concurrency.
Persistent low GPU utilization usually means the request rate is the bottleneck, not the card. Increase concurrency or batch size before moving to a bigger card.
For long-context models (>16K tokens), the KV cache dominates memory. Use FP8 on H100 to roughly halve KV cache memory.

For metric-driven troubleshooting, see Troubleshoot GPU Nodes and the monitoring guidance in Manage GPU Nodes.

Resource	Use it for
Run GPU Workloads in Docker	Container basics and toolkit install.
Bake and Reuse a GPU Image	Pre-bake vLLM or TGI to cut new-node boot to 3–6 min.
Choose a GPU Card	Sizing guidance per model size.
Troubleshoot GPU Nodes	OOM, CUDA mismatch, container failures.
Application Load Balancer	Production TLS and routing in front of inference endpoints.
TIR AI/ML Platform	Managed model endpoints if you don't want to operate the server.

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on June 26, 2026.

Pick a Card​

Serve a Model with vLLM​

Serve a Model with TGI​

Exposing the API​

Sizing Reminders​

Related Resources​

Pick a Card

Serve a Model with vLLM

Serve a Model with TGI

Exposing the API

Sizing Reminders

Related Resources