Serve LLM Inference
vLLM and Text Generation Inference (TGI) are the two production-grade LLM inference servers in common use on E2E GPU nodes. Both serve an OpenAI-compatible HTTP API, both ship as Docker images with CUDA pre-installed, and both support batched generation, paged attention, and quantization.
This guide shows the minimal pattern for each. For container basics, see Run GPU Workloads in Docker.
Pick a Card
The card has to fit the model weights plus the KV cache for your maximum context length and concurrency.
| Model size | Minimum card (FP16) | Comfortable card |
|---|---|---|
| 7B | L4 24 GB (tight), A30 24 GB | L40S 48 GB, A100 40 GB |
| 13B | L40S 48 GB, A100 40 GB | A100 80 GB, H100 80 GB |
| 34B | A100 80 GB, H100 80 GB | 2× A100 80 GB, 2× H100 |
| 70B | 2× A100 80 GB, 2× H100 80 GB | 4× A100 80 GB, 4× H100 |
| 70B with long context (>16K) | 4× A100 80 GB | 4× H100 (FP8 cuts memory by ~50%) |
Sharding across multiple cards costs latency. For latency-sensitive workloads, prefer a card whose memory comfortably fits the model on its own.
For full card comparisons, see Choose a GPU Card.
Serve a Model with vLLM
vLLM is the highest-throughput option for batched generation on most GPUs. The image exposes an OpenAI-compatible API on port 8000.
docker run -d --name vllm \
--gpus all \
--restart unless-stopped \
-p 127.0.0.1:8000:8000 \
-v /data/models:/root/.cache/huggingface \
-e HF_TOKEN=<your-token> \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192
For multi-GPU tensor parallelism:
docker run -d --name vllm \
--gpus all \
--restart unless-stopped \
--shm-size=16g \
-p 127.0.0.1:8000:8000 \
-v /data/models:/root/.cache/huggingface \
-e HF_TOKEN=<your-token> \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192
Test it from the same node:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "Hello", "max_tokens": 32}'
Serve a Model with TGI
Hugging Face TGI is the alternative when you want first-class support for streaming, safetensors loading, and Hugging Face Hub integration. The image exposes the API on port 80 inside the container.
docker run -d --name tgi \
--gpus all \
--restart unless-stopped \
--shm-size=16g \
-p 127.0.0.1:8080:80 \
-v /data/models:/data \
-e HF_TOKEN=<your-token> \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--max-input-length 4096 \
--max-total-tokens 8192
For multi-GPU sharding, add --num-shard <N> and use a multi-card plan.
Test it:
curl http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{"inputs": "Hello", "parameters": {"max_new_tokens": 32}}'
Exposing the API
Both servers in the patterns above bind to 127.0.0.1. That is intentional — do not open the inference port directly to the internet.
Recommended ways to expose the API to clients:
| Path | When to use it |
|---|---|
SSH local-forward tunnel (ssh -NL 8000:localhost:8000 root@<node-ip>) | Personal access, demos, internal use. No port exposed to the internet. |
| Application Load Balancer in front of the node | Production traffic. Adds TLS, rate limiting, and request logging. |
| Same-VPC client (private IP only) | Internal services calling the model from another node in the same VPC. |
If you must expose the port directly, restrict the security group to specific source IPs — never 0.0.0.0/0 on an inference endpoint.
Sizing Reminders
- Pick a card whose memory fits the model weights plus the KV cache for your maximum context length and concurrency.
- Watch GPU used memory in the Monitoring tab. If it stays near 100%, lower
--max-model-lenor reduce concurrency. - Persistent low GPU utilization usually means the request rate is the bottleneck, not the card. Increase concurrency or batch size before moving to a bigger card.
- For long-context models (>16K tokens), the KV cache dominates memory. Use FP8 on H100 to roughly halve KV cache memory.
For metric-driven troubleshooting, see Troubleshoot GPU Nodes and the monitoring guidance in Manage GPU Nodes.
Related Resources
| Resource | Use it for |
|---|---|
| Run GPU Workloads in Docker | Container basics and toolkit install. |
| Bake and Reuse a GPU Image | Pre-bake vLLM or TGI to cut new-node boot to 3–6 min. |
| Choose a GPU Card | Sizing guidance per model size. |
| Troubleshoot GPU Nodes | OOM, CUDA mismatch, container failures. |
| Application Load Balancer | Production TLS and routing in front of inference endpoints. |
| TIR AI/ML Platform | Managed model endpoints if you don't want to operate the server. |