---
title: Serve LLM Inference
---

# Serve LLM Inference

vLLM and Text Generation Inference (TGI) are the two production-grade LLM inference servers in common use on E2E GPU nodes. Both serve an OpenAI-compatible HTTP API, both ship as Docker images with CUDA pre-installed, and both support batched generation, paged attention, and quantization.

This guide shows the minimal pattern for each. For container basics, see [Run GPU Workloads in Docker](./run-gpu-containers).

---

## Pick a Card

The card has to fit the model weights **plus** the KV cache for your maximum context length and concurrency.

| Model size                  | Minimum card (FP16)            | Comfortable card                              |
| --------------------------- | ------------------------------ | --------------------------------------------- |
| 7B                          | L4 24 GB (tight), A30 24 GB    | L40S 48 GB, A100 40 GB                        |
| 13B                         | L40S 48 GB, A100 40 GB         | A100 80 GB, H100 80 GB                        |
| 34B                         | A100 80 GB, H100 80 GB         | 2× A100 80 GB, 2× H100                        |
| 70B                         | 2× A100 80 GB, 2× H100 80 GB   | 4× A100 80 GB, 4× H100                        |
| 70B with long context (>16K)| 4× A100 80 GB                  | 4× H100 (FP8 cuts memory by ~50%)             |

Sharding across multiple cards costs latency. For latency-sensitive workloads, prefer a card whose memory comfortably fits the model on its own.

For full card comparisons, see [Choose a GPU Card](/docs/myaccount/gpu/getting-started/choose-gpu-card).

---

## Serve a Model with vLLM

vLLM is the highest-throughput option for batched generation on most GPUs. The image exposes an OpenAI-compatible API on port `8000`.

```bash
docker run -d --name vllm \
  --gpus all \
  --restart unless-stopped \
  -p 127.0.0.1:8000:8000 \
  -v /data/models:/root/.cache/huggingface \
  -e HF_TOKEN=<your-token> \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192
```

For multi-GPU tensor parallelism:

```bash
docker run -d --name vllm \
  --gpus all \
  --restart unless-stopped \
  --shm-size=16g \
  -p 127.0.0.1:8000:8000 \
  -v /data/models:/root/.cache/huggingface \
  -e HF_TOKEN=<your-token> \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192
```

Test it from the same node:

```bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "Hello", "max_tokens": 32}'
```

---

## Serve a Model with TGI

Hugging Face TGI is the alternative when you want first-class support for streaming, safetensors loading, and Hugging Face Hub integration. The image exposes the API on port `80` inside the container.

```bash
docker run -d --name tgi \
  --gpus all \
  --restart unless-stopped \
  --shm-size=16g \
  -p 127.0.0.1:8080:80 \
  -v /data/models:/data \
  -e HF_TOKEN=<your-token> \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --max-input-length 4096 \
  --max-total-tokens 8192
```

For multi-GPU sharding, add `--num-shard <N>` and use a multi-card plan.

Test it:

```bash
curl http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"inputs": "Hello", "parameters": {"max_new_tokens": 32}}'
```

---

## Exposing the API

Both servers in the patterns above bind to `127.0.0.1`. That is intentional — do not open the inference port directly to the internet.

Recommended ways to expose the API to clients:

| Path                                                                                       | When to use it                                                                |
| ------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------- |
| SSH local-forward tunnel (`ssh -NL 8000:localhost:8000 root@<node-ip>`)                    | Personal access, demos, internal use. No port exposed to the internet.        |
| Application Load Balancer in front of the node                                             | Production traffic. Adds TLS, rate limiting, and request logging.             |
| Same-VPC client (private IP only)                                                          | Internal services calling the model from another node in the same VPC.       |

If you must expose the port directly, restrict the security group to specific source IPs — never `0.0.0.0/0` on an inference endpoint.

---

## Sizing Reminders

- Pick a card whose memory fits the model weights **plus** the KV cache for your maximum context length and concurrency.
- Watch GPU used memory in the Monitoring tab. If it stays near 100%, lower `--max-model-len` or reduce concurrency.
- Persistent low GPU utilization usually means the request rate is the bottleneck, not the card. Increase concurrency or batch size before moving to a bigger card.
- For long-context models (>16K tokens), the KV cache dominates memory. Use FP8 on H100 to roughly halve KV cache memory.

For metric-driven troubleshooting, see [Troubleshoot GPU Nodes](/docs/myaccount/gpu/troubleshoot) and the monitoring guidance in [Manage GPU Nodes](/docs/myaccount/gpu/manage).

---

## Related Resources

| Resource                                                                          | Use it for                                                       |
| --------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| [Run GPU Workloads in Docker](./run-gpu-containers)                               | Container basics and toolkit install.                            |
| [Bake and Reuse a GPU Image](./save-and-reuse-images)                             | Pre-bake vLLM or TGI to cut new-node boot to 3–6 min.            |
| [Choose a GPU Card](/docs/myaccount/gpu/getting-started/choose-gpu-card)          | Sizing guidance per model size.                                  |
| [Troubleshoot GPU Nodes](/docs/myaccount/gpu/troubleshoot)                        | OOM, CUDA mismatch, container failures.                          |
| [Application Load Balancer](/docs/myaccount/appliance/Application-Load-Balancer)  | Production TLS and routing in front of inference endpoints.      |
| [TIR AI/ML Platform](/docs/tir/)                                                  | Managed model endpoints if you don't want to operate the server. |