Skip to main content

Serve LLM Inference

vLLM and Text Generation Inference (TGI) are the two production-grade LLM inference servers in common use on E2E GPU nodes. Both serve an OpenAI-compatible HTTP API, both ship as Docker images with CUDA pre-installed, and both support batched generation, paged attention, and quantization.

This guide shows the minimal pattern for each. For container basics, see Run GPU Workloads in Docker.


Pick a Card

The card has to fit the model weights plus the KV cache for your maximum context length and concurrency.

Model sizeMinimum card (FP16)Comfortable card
7BL4 24 GB (tight), A30 24 GBL40S 48 GB, A100 40 GB
13BL40S 48 GB, A100 40 GBA100 80 GB, H100 80 GB
34BA100 80 GB, H100 80 GB2× A100 80 GB, 2× H100
70B2× A100 80 GB, 2× H100 80 GB4× A100 80 GB, 4× H100
70B with long context (>16K)4× A100 80 GB4× H100 (FP8 cuts memory by ~50%)

Sharding across multiple cards costs latency. For latency-sensitive workloads, prefer a card whose memory comfortably fits the model on its own.

For full card comparisons, see Choose a GPU Card.


Serve a Model with vLLM

vLLM is the highest-throughput option for batched generation on most GPUs. The image exposes an OpenAI-compatible API on port 8000.

docker run -d --name vllm \
--gpus all \
--restart unless-stopped \
-p 127.0.0.1:8000:8000 \
-v /data/models:/root/.cache/huggingface \
-e HF_TOKEN=<your-token> \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192

For multi-GPU tensor parallelism:

docker run -d --name vllm \
--gpus all \
--restart unless-stopped \
--shm-size=16g \
-p 127.0.0.1:8000:8000 \
-v /data/models:/root/.cache/huggingface \
-e HF_TOKEN=<your-token> \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192

Test it from the same node:

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "Hello", "max_tokens": 32}'

Serve a Model with TGI

Hugging Face TGI is the alternative when you want first-class support for streaming, safetensors loading, and Hugging Face Hub integration. The image exposes the API on port 80 inside the container.

docker run -d --name tgi \
--gpus all \
--restart unless-stopped \
--shm-size=16g \
-p 127.0.0.1:8080:80 \
-v /data/models:/data \
-e HF_TOKEN=<your-token> \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--max-input-length 4096 \
--max-total-tokens 8192

For multi-GPU sharding, add --num-shard <N> and use a multi-card plan.

Test it:

curl http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{"inputs": "Hello", "parameters": {"max_new_tokens": 32}}'

Exposing the API

Both servers in the patterns above bind to 127.0.0.1. That is intentional — do not open the inference port directly to the internet.

Recommended ways to expose the API to clients:

PathWhen to use it
SSH local-forward tunnel (ssh -NL 8000:localhost:8000 root@<node-ip>)Personal access, demos, internal use. No port exposed to the internet.
Application Load Balancer in front of the nodeProduction traffic. Adds TLS, rate limiting, and request logging.
Same-VPC client (private IP only)Internal services calling the model from another node in the same VPC.

If you must expose the port directly, restrict the security group to specific source IPs — never 0.0.0.0/0 on an inference endpoint.


Sizing Reminders

  • Pick a card whose memory fits the model weights plus the KV cache for your maximum context length and concurrency.
  • Watch GPU used memory in the Monitoring tab. If it stays near 100%, lower --max-model-len or reduce concurrency.
  • Persistent low GPU utilization usually means the request rate is the bottleneck, not the card. Increase concurrency or batch size before moving to a bigger card.
  • For long-context models (>16K tokens), the KV cache dominates memory. Use FP8 on H100 to roughly halve KV cache memory.

For metric-driven troubleshooting, see Troubleshoot GPU Nodes and the monitoring guidance in Manage GPU Nodes.


ResourceUse it for
Run GPU Workloads in DockerContainer basics and toolkit install.
Bake and Reuse a GPU ImagePre-bake vLLM or TGI to cut new-node boot to 3–6 min.
Choose a GPU CardSizing guidance per model size.
Troubleshoot GPU NodesOOM, CUDA mismatch, container failures.
Application Load BalancerProduction TLS and routing in front of inference endpoints.
TIR AI/ML PlatformManaged model endpoints if you don't want to operate the server.
Last updated on May 26, 2026.