system design · system-design · domain

Design LLM Inference Serving (Triton + vLLM + PagedAttention)

Triton Inference Server, dynamic batching, KV-cache management, tensor parallelism, paged attention.

expert5hcudaml-aipythonsystem-design

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

LLM inference is autoregressive, generate token by token. Naive batching wastes compute because batches finish unevenly. Continuous batching + paged KV cache + tensor parallelism = high throughput, low latency, large context.

Triton (or vLLM) accepts requests, batches dynamically. Each request has prompt + generation budget. Continuous batching: as soon as a request finishes, slot fills with new request, no batch boundary stalls. KV cache memory fragments across requests of varying lengths → use paged attention: KV cache split into fixed-size blocks (16 tokens), allocated like virtual memory. Tensor parallelism splits attention heads across GPUs for large models. Speculative decoding: small draft model proposes tokens; large model verifies in parallel for 2-3x speedup.

When to use

High-throughput LLM serving (chat, agents, code completion).

When not to

Single-user batch jobs. Tiny models that fit on CPU.

flowchart TB
  Req1[Request 1 · 100 tokens] --> Queue[Continuous Batcher]
  Req2[Request 2 · 50 tokens] --> Queue
  Req3[Request 3 · 200 tokens] --> Queue
  Queue --> Engine[vLLM Engine]
  Engine --> Paged[(Paged KV Cache · 16-token blocks)]
  Engine --> TP[Tensor Parallel · 8 GPUs]
  TP --> GPUs[8 H100 GPUs]
  Engine -.token stream.-> Req1
  Engine -.token stream.-> Req2
  Engine -.token stream.-> Req3
  Spec[Draft Model · 1B] -.proposes.-> Engine
  Engine -.verifies.-> Spec

Key insights

KV cache is the memory hog: per request, grows linearly with sequence length. PagedAttention treats it like virtual memory.
Continuous batching = serve N requests concurrently; finished slots refill mid-batch. 2-10x throughput vs static batching.
Tensor parallelism for >70B models that do not fit on one GPU.
Speculative decoding: small model proposes K tokens, large model verifies; accept prefix where probs match.
Time-to-first-token (TTFT) vs throughput, different optimizations conflict.