System Design Notes All designs

AI / ML Infrastructure

LLM KV-Cache Management

During autoregressive decoding, a transformer re-reads the keys and values of every prior token at every step. Caching those K/V tensors turns generation from quadratic recompute into a single append per token — but the cache grows linearly with sequence length and, across many concurrent requests, becomes the dominant consumer of GPU HBM. KV-cache management is the craft of packing the most sequences into a fixed memory budget: paging it like OS virtual memory, sharing common prefixes, evicting under pressure, and shrinking it with GQA, quantization, and sliding windows.

Background: what the KV cache is

In a causal decoder, each token's query attends over the keys of all earlier tokens and reads a weighted sum of their values. Token t attends to tokens 0..t. The crucial observation: for a frozen model, the K and V vectors of past tokens never change as generation proceeds. A naive decoder would recompute K and V for the entire prefix at every step — quadratic work. Instead we compute each token's K/V once and keep them in the KV cache, so every later step does just one append plus an attention read.

Inference therefore splits into two phases with different cost profiles:

Per sequence the cache has logical shape [layers][2][kv_heads][seq_len][head_dim] — it grows by one slot per layer and head for every generated token. After the model weights, the KV cache is the second major memory consumer, and at serving scale it is the largest variable one. It is usually the cache, not the FLOPs, that caps how many requests you can batch together.

sequenceDiagram
    participant U as Client
    participant E as Inference Engine
    participant KV as KV Cache (HBM)
    U->>E: Prompt of N tokens
    Note over E,KV: Prefill (parallel)
    E->>KV: Compute and store K,V for all N tokens
    E-->>U: First token (TTFT)
    Note over E,KV: Decode (loop per token)
    E->>KV: Append K,V for the new token
    KV-->>E: All past K,V for attention
    E-->>U: Next token (ITL)
      

Functional requirements

Non-functional requirements

Scale & memory math

The per-token cost is a clean back-of-envelope formula:

kv_bytes_per_token = 2 × layers × kv_heads × head_dim × dtype_bytes

The leading 2 counts K and V; dtype_bytes = 2 for FP16/BF16. The product kv_heads × head_dim is the KV hidden size, which equals d_model only under full multi-head attention — grouped-query attention shrinks it directly. Per-sequence cost is just kv_bytes_per_token × seq_len.

Worked example — a 70B model (e.g. Llama-2-70B): layers = 80, d_model = 8192, 64 query heads of head_dim = 128, GQA with 8 KV heads.

Config KV hidden (kv_heads × head_dim) Bytes / token (FP16)
Full MHA (64 KV heads) 8192 ≈ 2.5 MB
GQA (8 KV heads) 1024 ≈ 320 KB
GQA + FP8 KV 1024 ≈ 160 KB

GQA already buys an reduction versus full MHA. Now multiply by context length to see the blow-up (using the realistic GQA figure of 320 KB/token, with the full-MHA worst case for contrast):

Context length Per sequence (GQA) Per sequence (full MHA)
2K tokens ≈ 0.6 GB ≈ 5 GB
8K tokens ≈ 2.5 GB ≈ 20 GB
32K tokens ≈ 10 GB ≈ 80 GB
128K tokens ≈ 40 GB ≈ 320 GB

Put that against hardware. A 70B model in FP16 needs ≈ 140 GB of weights, so it already spans 2× H100 (160 GB), leaving only ≈ 20 GB for KV. At 8K context (2.5 GB/seq) that is roughly 8 concurrent sequences; a single 128K sequence (40 GB) does not even fit. Quantizing weights (FP8/INT4) frees room, but the ceiling is fundamentally set by how well you manage the cache.

The cache OOMs you, not the math

The danger is not steady-state size but reservation. A naive engine pre-allocates a contiguous buffer of max_seq_len per request up front, so 20 GB of free HBM might admit only 2–3 requests even when their actual outputs are short. Reservation plus fragmentation — not arithmetic — is the first thing that kills concurrency, and the first thing paging fixes.

PagedAttention: the core idea

Classic serving stores each sequence's KV in a single contiguous tensor sized to the maximum possible length. That causes two wastes. Internal fragmentation: the slab is reserved for max_seq_len but most sequences finish far earlier, so the tail sits idle. External fragmentation: variable-sized slabs leave gaps that no new sequence fits into. Together these can waste 60–80% of KV memory.

PagedAttention (introduced by vLLM) borrows OS virtual memory. Each sequence's KV is split into fixed-size blocks/pages (e.g. 16 tokens per block). Blocks need not be contiguous in HBM. A per-sequence block table maps each logical block number to a physical block address — exactly a page table mapping virtual pages to physical frames. Blocks are allocated on demand, one at a time, as the sequence generates tokens.

flowchart LR
    subgraph Logical["Logical token sequence (Seq A)"]
        T0["Tokens 0-15"]
        T1["Tokens 16-31"]
        T2["Tokens 32-47"]
    end
    BT["Block Table (per seq)"]
    subgraph Physical["Physical KV blocks in HBM"]
        P7["Block 7"]
        P2["Block 2"]
        P9["Block 9"]
    end
    T0 --> BT
    T1 --> BT
    T2 --> BT
    BT -->|"logical 0"| P7
    BT -->|"logical 1"| P2
    BT -->|"logical 2"| P9
      

Because allocation is block-granular, the only waste is the partial last block of each sequence — under one block, so total waste drops to under 4%. The attention kernel is modified to gather K/V from the non-contiguous physical blocks via the block table. With fragmentation gone, an engine can pack 2–4× more sequences into the same HBM, which is most of where vLLM's large throughput gains over naive serving come from.

Mapping to OS virtual memory

Logical block ↔ virtual page; physical block ↔ frame; block table ↔ page table; on-demand growth ↔ demand paging; and — next section — shared blocks ↔ copy-on-write fork(). If you know how an OS manages RAM, you already know how to manage the KV cache.

Prefix sharing & copy-on-write

Once KV lives in blocks, identical blocks can be shared by reference instead of duplicated. Three big wins:

Sharing stays safe via copy-on-write. A shared block is read-only. When a sequence must write into a shared block — typically the partial last block where two sequences begin to diverge — the engine decrements the refcount, allocates a fresh block, copies the contents, and writes there, leaving co-sharers untouched.

flowchart TD
    P0["Prefix block 0 (refcount 2)"]
    P1["Prefix block 1 (refcount 2)"]
    P0 --> P1
    P1 --> A["Seq A: new block via COW"]
    P1 --> B["Seq B: new block via COW"]
      

Trade-offs: refcount bookkeeping and COW copies add a little latency at divergence points; shared blocks cannot be evicted while any sequence references them; and prefix-cache hits require that both the tokens and their positions match, so correctness hinges on careful hashing and eviction of the prefix cache.

Eviction, preemption & offload

Memory is finite, so the scheduler over-commits and admits more sequences than worst-case capacity. When every block is allocated and a running sequence needs a new one, the engine must free memory by preempting a victim sequence. There are two recovery strategies:

Strategy How it frees HBM Cost to resume
Swap (offload) Copy victim's KV blocks out to host DRAM, free the HBM Copy back in — gated by PCIe/NVLink bandwidth
Recompute Drop victim's KV entirely Re-run prefill to rebuild KV (parallel, often cheaper)

For short and medium sequences, recompute is frequently as fast as or faster than swapping, because prefill is a single parallel pass whereas a swap pays the transfer twice (out and back). Swapping wins when the prefix is expensive to rebuild or can be reused.

Beyond preemption, KV can be offloaded across a memory hierarchy — HBM → host DRAM → NVMe — to serve very long contexts or very high concurrency, trading latency for capacity, with hot blocks kept resident in HBM.

flowchart TD
    N["New tokens need a block"]
    N --> F{"Free block available?"}
    F -->|Yes| AL["Allocate block"]
    F -->|No| V{"Pick a victim seq"}
    V -->|Swap| SW["Copy blocks to host DRAM"]
    V -->|Recompute| RC["Drop blocks, replay prefill on resume"]
    SW --> AL
    RC --> AL
      

Scheduling policies: admission is usually FCFS with preemption; allocation per decode step is all-or-nothing (a sequence runs only if all its blocks fit); victim selection commonly preempts the newest or lowest-priority sequence (LIFO-ish) to avoid cascading preemptions. This pairs directly with continuous batching (iteration-level scheduling, à la Orca): the batch is re-formed every decode iteration — finished sequences free their blocks and new ones are admitted token-by-token — which is precisely why paging and continuous batching are co-designed.

Shrinking the cache

Paging packs the cache efficiently; these techniques make the cache smaller in the first place, attacking the per-token term of the formula or bounding seq_len itself.

Technique Mechanism KV reduction Trade-off
MQA (multi-query) All query heads share a single K/V head ÷ num_heads (e.g. 64×) Largest quality hit
GQA (grouped-query) KV heads grouped (e.g. 8 groups) ÷ (num_heads / groups), e.g. 8× Small quality hit; the modern default
Quantized KV Store K/V as INT8/FP8 (or INT4) with scales 2–4× Accuracy loss; keys are sensitive
Sliding window Attend only to the last W tokens (rolling buffer) Bounds KV to W (constant) Loses far context; pair with attention sinks
Prefix caching Reuse prefill KV for shared prefixes across requests Saves recompute + memory Hashing/eviction; positions must match
MLA (latent attention) Compress K/V into a low-rank latent (DeepSeek) Large Architecture change, trained-in

These compose. A typical long-context server runs GQA weights with an FP8 KV cache under PagedAttention, plus prefix caching for shared system prompts; streaming chat workloads may add a sliding window with attention sinks (keep the first few "sink" tokens plus a recent window) so the cache stays bounded over very long sessions.

Bottlenecks & scaling

Bottleneck Why it bites Mitigation
HBM capacity Weights + KV share ~80 GB/GPU; KV grows with load Quantize weights + KV; tensor/pipeline parallelism; paging; offload
Fragmentation Contiguous max-length reservation wastes 60–80% PagedAttention fixed blocks (waste under 4%)
Long-context blow-up KV is linear in length; 128K ≈ 40 GB/seq (GQA) GQA/MQA, sliding window + sinks, quantized KV, MLA
Swap bandwidth PCIe ≈ tens of GB/s gates swap-out/in Prefer recompute; NVLink; overlap copy with compute; keep hot blocks resident
Concurrency ceiling At long context, serving is memory-bound, not FLOP-bound Maximize batch via paging + sharing + continuous batching
Kernel gather cost Reading non-contiguous blocks adds indirection Tuned PagedAttention kernels; block-size tuning (e.g. 16)

Summary

The KV cache is the memory that makes autoregressive decoding cheap, and managing it is what makes LLM serving economical. Treat HBM like an operating system would: page the cache into fixed blocks with a per-sequence block table to erase fragmentation; share common prefixes by reference with copy-on-write; preempt under pressure with recompute-or-swap and continuous batching; and shrink the cache up front with GQA/MQA, quantization, sliding windows, and prefix reuse. Do all four and a fixed GPU budget serves many times the concurrent sequences — the difference between a model that is too expensive to deploy and one that is not.