--- title: Serve LLM Inference --- # Serve LLM Inference vLLM and Text Generation Inference (TGI) are the two production-grade LLM inference servers in common use on E2E GPU nodes. Both serve an OpenAI-compatible HTTP API, both ship as Docker images with CUDA pre-installed, and both support batched generation, paged attention, and quantization. This guide shows the minimal pattern for each. For container basics, see [Run GPU Workloads in Docker](./run-gpu-containers). --- ## Pick a Card The card has to fit the model weights **plus** the KV cache for your maximum context length and concurrency. | Model size | Minimum card (FP16) | Comfortable card | | --------------------------- | ------------------------------ | --------------------------------------------- | | 7B | L4 24 GB (tight), A30 24 GB | L40S 48 GB, A100 40 GB | | 13B | L40S 48 GB, A100 40 GB | A100 80 GB, H100 80 GB | | 34B | A100 80 GB, H100 80 GB | 2× A100 80 GB, 2× H100 | | 70B | 2× A100 80 GB, 2× H100 80 GB | 4× A100 80 GB, 4× H100 | | 70B with long context (>16K)| 4× A100 80 GB | 4× H100 (FP8 cuts memory by ~50%) | Sharding across multiple cards costs latency. For latency-sensitive workloads, prefer a card whose memory comfortably fits the model on its own. For full card comparisons, see [Choose a GPU Card](/docs/myaccount/gpu/getting-started/choose-gpu-card). --- ## Serve a Model with vLLM vLLM is the highest-throughput option for batched generation on most GPUs. The image exposes an OpenAI-compatible API on port `8000`. ```bash docker run -d --name vllm \ --gpus all \ --restart unless-stopped \ -p 127.0.0.1:8000:8000 \ -v /data/models:/root/.cache/huggingface \ -e HF_TOKEN= \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-8B-Instruct \ --max-model-len 8192 ``` For multi-GPU tensor parallelism: ```bash docker run -d --name vllm \ --gpus all \ --restart unless-stopped \ --shm-size=16g \ -p 127.0.0.1:8000:8000 \ -v /data/models:/root/.cache/huggingface \ -e HF_TOKEN= \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-70B-Instruct \ --tensor-parallel-size 4 \ --max-model-len 8192 ``` Test it from the same node: ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "Hello", "max_tokens": 32}' ``` --- ## Serve a Model with TGI Hugging Face TGI is the alternative when you want first-class support for streaming, safetensors loading, and Hugging Face Hub integration. The image exposes the API on port `80` inside the container. ```bash docker run -d --name tgi \ --gpus all \ --restart unless-stopped \ --shm-size=16g \ -p 127.0.0.1:8080:80 \ -v /data/models:/data \ -e HF_TOKEN= \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Llama-3.1-8B-Instruct \ --max-input-length 4096 \ --max-total-tokens 8192 ``` For multi-GPU sharding, add `--num-shard ` and use a multi-card plan. Test it: ```bash curl http://localhost:8080/generate \ -H "Content-Type: application/json" \ -d '{"inputs": "Hello", "parameters": {"max_new_tokens": 32}}' ``` --- ## Exposing the API Both servers in the patterns above bind to `127.0.0.1`. That is intentional — do not open the inference port directly to the internet. Recommended ways to expose the API to clients: | Path | When to use it | | ------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------- | | SSH local-forward tunnel (`ssh -NL 8000:localhost:8000 root@`) | Personal access, demos, internal use. No port exposed to the internet. | | Application Load Balancer in front of the node | Production traffic. Adds TLS, rate limiting, and request logging. | | Same-VPC client (private IP only) | Internal services calling the model from another node in the same VPC. | If you must expose the port directly, restrict the security group to specific source IPs — never `0.0.0.0/0` on an inference endpoint. --- ## Sizing Reminders - Pick a card whose memory fits the model weights **plus** the KV cache for your maximum context length and concurrency. - Watch GPU used memory in the Monitoring tab. If it stays near 100%, lower `--max-model-len` or reduce concurrency. - Persistent low GPU utilization usually means the request rate is the bottleneck, not the card. Increase concurrency or batch size before moving to a bigger card. - For long-context models (>16K tokens), the KV cache dominates memory. Use FP8 on H100 to roughly halve KV cache memory. For metric-driven troubleshooting, see [Troubleshoot GPU Nodes](/docs/myaccount/gpu/troubleshoot) and the monitoring guidance in [Manage GPU Nodes](/docs/myaccount/gpu/manage). --- ## Related Resources | Resource | Use it for | | --------------------------------------------------------------------------------- | ---------------------------------------------------------------- | | [Run GPU Workloads in Docker](./run-gpu-containers) | Container basics and toolkit install. | | [Bake and Reuse a GPU Image](./save-and-reuse-images) | Pre-bake vLLM or TGI to cut new-node boot to 3–6 min. | | [Choose a GPU Card](/docs/myaccount/gpu/getting-started/choose-gpu-card) | Sizing guidance per model size. | | [Troubleshoot GPU Nodes](/docs/myaccount/gpu/troubleshoot) | OOM, CUDA mismatch, container failures. | | [Application Load Balancer](/docs/myaccount/appliance/Application-Load-Balancer) | Production TLS and routing in front of inference endpoints. | | [TIR AI/ML Platform](/docs/tir/) | Managed model endpoints if you don't want to operate the server. |