system design · system-design · domain
Design LLM Inference Serving (Triton + vLLM + PagedAttention)
Triton Inference Server, dynamic batching, KV-cache management, tensor parallelism, paged attention.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
LLM inference is autoregressive, generate token by token. Naive batching wastes compute because batches finish unevenly. Continuous batching + paged KV cache + tensor parallelism = high throughput, low latency, large context.
Triton (or vLLM) accepts requests, batches dynamically. Each request has prompt + generation budget. Continuous batching: as soon as a request finishes, slot fills with new request, no batch boundary stalls. KV cache memory fragments across requests of varying lengths → use paged attention: KV cache split into fixed-size blocks (16 tokens), allocated like virtual memory. Tensor parallelism splits attention heads across GPUs for large models. Speculative decoding: small draft model proposes tokens; large model verifies in parallel for 2-3x speedup.
When to use
High-throughput LLM serving (chat, agents, code completion).
When not to
Single-user batch jobs. Tiny models that fit on CPU.
flowchart TB Req1[Request 1 · 100 tokens] --> Queue[Continuous Batcher] Req2[Request 2 · 50 tokens] --> Queue Req3[Request 3 · 200 tokens] --> Queue Queue --> Engine[vLLM Engine] Engine --> Paged[(Paged KV Cache · 16-token blocks)] Engine --> TP[Tensor Parallel · 8 GPUs] TP --> GPUs[8 H100 GPUs] Engine -.token stream.-> Req1 Engine -.token stream.-> Req2 Engine -.token stream.-> Req3 Spec[Draft Model · 1B] -.proposes.-> Engine Engine -.verifies.-> Spec
Key insights
- KV cache is the memory hog: per request, grows linearly with sequence length. PagedAttention treats it like virtual memory.
- Continuous batching = serve N requests concurrently; finished slots refill mid-batch. 2-10x throughput vs static batching.
- Tensor parallelism for >70B models that do not fit on one GPU.
- Speculative decoding: small model proposes K tokens, large model verifies; accept prefix where probs match.
- Time-to-first-token (TTFT) vs throughput, different optimizations conflict.