AI / ML Infrastructure
LLM KV-Cache Management
During autoregressive decoding, a transformer re-reads the keys and values of every prior token at every step. Caching those K/V tensors turns generation from quadratic recompute into a single append per token — but the cache grows linearly with sequence length and, across many concurrent requests, becomes the dominant consumer of GPU HBM. KV-cache management is the craft of packing the most sequences into a fixed memory budget: paging it like OS virtual memory, sharing common prefixes, evicting under pressure, and shrinking it with GQA, quantization, and sliding windows.
Background: what the KV cache is
In a causal decoder, each token's query attends over
the keys of all earlier tokens and reads a weighted
sum of their values. Token t attends to
tokens 0..t. The crucial observation: for a frozen model,
the K and V vectors of past tokens never change as generation
proceeds. A naive decoder would recompute K and V for the entire
prefix at every step — quadratic work. Instead we compute each token's
K/V once and keep them in the
KV cache, so every later step does just one append
plus an attention read.
Inference therefore splits into two phases with different cost profiles:
-
Prefill — feed the whole
N-token prompt in one parallel pass, compute and store K/V for allNpositions, and emit the first token. Compute-bound; it sets TTFT (time-to-first-token). - Decode — for each new token compute only its single Q/K/V, append the K/V to the cache, attend over the entire cache, and emit one token. Memory-bandwidth-bound; it sets ITL (inter-token latency).
Per sequence the cache has logical shape
[layers][2][kv_heads][seq_len][head_dim] — it grows by
one slot per layer and head for every generated token. After
the model weights, the KV cache is the second major memory consumer,
and at serving scale it is the largest variable one.
It is usually the cache, not the FLOPs, that caps how many requests
you can batch together.
sequenceDiagram
participant U as Client
participant E as Inference Engine
participant KV as KV Cache (HBM)
U->>E: Prompt of N tokens
Note over E,KV: Prefill (parallel)
E->>KV: Compute and store K,V for all N tokens
E-->>U: First token (TTFT)
Note over E,KV: Decode (loop per token)
E->>KV: Append K,V for the new token
KV-->>E: All past K,V for attention
E-->>U: Next token (ITL)
Functional requirements
- Serve concurrent decode for many sequences of variable, unknown length (generation halts at EOS or a token cap).
- Support both prefill and decode, plus long context (tens to hundreds of thousands of tokens) and streaming output.
- Support cache-touching features: shared prefixes (system prompts), beam search / parallel sampling, and prefix reuse across requests.
Non-functional requirements
- Maximize throughput and concurrency (tokens/sec, live sequences) within a fixed HBM budget — and never OOM.
- Low TTFT and low ITL; predictable tail latency under load.
- High memory utilization — minimal fragmentation and reservation waste.
- Graceful degradation: preempt and reschedule rather than crash; fairness across tenants.
Scale & memory math
The per-token cost is a clean back-of-envelope formula:
kv_bytes_per_token = 2 × layers × kv_heads × head_dim ×
dtype_bytes
The leading 2 counts K and V;
dtype_bytes = 2 for FP16/BF16. The product
kv_heads × head_dim is the KV hidden size, which
equals d_model only under full multi-head attention —
grouped-query attention shrinks it directly. Per-sequence cost is just
kv_bytes_per_token × seq_len.
Worked example — a 70B model (e.g. Llama-2-70B):
layers = 80, d_model = 8192,
64 query heads of head_dim = 128, GQA with
8 KV heads.
| Config | KV hidden (kv_heads × head_dim) | Bytes / token (FP16) |
|---|---|---|
| Full MHA (64 KV heads) | 8192 |
≈ 2.5 MB |
| GQA (8 KV heads) | 1024 |
≈ 320 KB |
| GQA + FP8 KV | 1024 |
≈ 160 KB |
GQA already buys an 8× reduction versus full MHA. Now multiply by context length to see the blow-up (using the realistic GQA figure of 320 KB/token, with the full-MHA worst case for contrast):
| Context length | Per sequence (GQA) | Per sequence (full MHA) |
|---|---|---|
| 2K tokens | ≈ 0.6 GB | ≈ 5 GB |
| 8K tokens | ≈ 2.5 GB | ≈ 20 GB |
| 32K tokens | ≈ 10 GB | ≈ 80 GB |
| 128K tokens | ≈ 40 GB | ≈ 320 GB |
Put that against hardware. A 70B model in FP16 needs ≈ 140 GB of weights, so it already spans 2× H100 (160 GB), leaving only ≈ 20 GB for KV. At 8K context (2.5 GB/seq) that is roughly 8 concurrent sequences; a single 128K sequence (40 GB) does not even fit. Quantizing weights (FP8/INT4) frees room, but the ceiling is fundamentally set by how well you manage the cache.
The cache OOMs you, not the math
The danger is not steady-state size but
reservation. A naive engine pre-allocates a
contiguous buffer of max_seq_len per request up front,
so 20 GB of free HBM might admit only 2–3 requests even when their
actual outputs are short. Reservation plus fragmentation — not
arithmetic — is the first thing that kills concurrency, and the
first thing paging fixes.
PagedAttention: the core idea
Classic serving stores each sequence's KV in a single
contiguous tensor sized to the maximum possible
length. That causes two wastes.
Internal fragmentation: the slab is reserved for
max_seq_len but most sequences finish far earlier, so the
tail sits idle. External fragmentation:
variable-sized slabs leave gaps that no new sequence fits into.
Together these can waste 60–80% of KV memory.
PagedAttention (introduced by vLLM) borrows OS
virtual memory. Each sequence's KV is split into fixed-size
blocks/pages (e.g. 16 tokens per block).
Blocks need not be contiguous in HBM. A per-sequence
block table maps each logical block number to a
physical block address — exactly a page table mapping virtual pages to
physical frames. Blocks are allocated on demand, one
at a time, as the sequence generates tokens.
flowchart LR
subgraph Logical["Logical token sequence (Seq A)"]
T0["Tokens 0-15"]
T1["Tokens 16-31"]
T2["Tokens 32-47"]
end
BT["Block Table (per seq)"]
subgraph Physical["Physical KV blocks in HBM"]
P7["Block 7"]
P2["Block 2"]
P9["Block 9"]
end
T0 --> BT
T1 --> BT
T2 --> BT
BT -->|"logical 0"| P7
BT -->|"logical 1"| P2
BT -->|"logical 2"| P9
Because allocation is block-granular, the only waste is the partial last block of each sequence — under one block, so total waste drops to under 4%. The attention kernel is modified to gather K/V from the non-contiguous physical blocks via the block table. With fragmentation gone, an engine can pack 2–4× more sequences into the same HBM, which is most of where vLLM's large throughput gains over naive serving come from.
Mapping to OS virtual memory
Logical block ↔ virtual page; physical block ↔ frame; block table ↔
page table; on-demand growth ↔ demand paging; and — next section —
shared blocks ↔ copy-on-write fork(). If you know how
an OS manages RAM, you already know how to manage the KV cache.
Eviction, preemption & offload
Memory is finite, so the scheduler over-commits and admits more sequences than worst-case capacity. When every block is allocated and a running sequence needs a new one, the engine must free memory by preempting a victim sequence. There are two recovery strategies:
| Strategy | How it frees HBM | Cost to resume |
|---|---|---|
| Swap (offload) | Copy victim's KV blocks out to host DRAM, free the HBM | Copy back in — gated by PCIe/NVLink bandwidth |
| Recompute | Drop victim's KV entirely | Re-run prefill to rebuild KV (parallel, often cheaper) |
For short and medium sequences, recompute is frequently as fast as or faster than swapping, because prefill is a single parallel pass whereas a swap pays the transfer twice (out and back). Swapping wins when the prefix is expensive to rebuild or can be reused.
Beyond preemption, KV can be offloaded across a memory hierarchy — HBM → host DRAM → NVMe — to serve very long contexts or very high concurrency, trading latency for capacity, with hot blocks kept resident in HBM.
flowchart TD
N["New tokens need a block"]
N --> F{"Free block available?"}
F -->|Yes| AL["Allocate block"]
F -->|No| V{"Pick a victim seq"}
V -->|Swap| SW["Copy blocks to host DRAM"]
V -->|Recompute| RC["Drop blocks, replay prefill on resume"]
SW --> AL
RC --> AL
Scheduling policies: admission is usually FCFS with preemption; allocation per decode step is all-or-nothing (a sequence runs only if all its blocks fit); victim selection commonly preempts the newest or lowest-priority sequence (LIFO-ish) to avoid cascading preemptions. This pairs directly with continuous batching (iteration-level scheduling, à la Orca): the batch is re-formed every decode iteration — finished sequences free their blocks and new ones are admitted token-by-token — which is precisely why paging and continuous batching are co-designed.
Shrinking the cache
Paging packs the cache efficiently; these techniques make the cache
smaller in the first place, attacking the per-token
term of the formula or bounding seq_len itself.
| Technique | Mechanism | KV reduction | Trade-off |
|---|---|---|---|
| MQA (multi-query) | All query heads share a single K/V head | ÷ num_heads (e.g. 64×) | Largest quality hit |
| GQA (grouped-query) | KV heads grouped (e.g. 8 groups) | ÷ (num_heads / groups), e.g. 8× | Small quality hit; the modern default |
| Quantized KV | Store K/V as INT8/FP8 (or INT4) with scales | 2–4× | Accuracy loss; keys are sensitive |
| Sliding window |
Attend only to the last W tokens (rolling buffer)
|
Bounds KV to W (constant) |
Loses far context; pair with attention sinks |
| Prefix caching | Reuse prefill KV for shared prefixes across requests | Saves recompute + memory | Hashing/eviction; positions must match |
| MLA (latent attention) | Compress K/V into a low-rank latent (DeepSeek) | Large | Architecture change, trained-in |
These compose. A typical long-context server runs GQA weights with an FP8 KV cache under PagedAttention, plus prefix caching for shared system prompts; streaming chat workloads may add a sliding window with attention sinks (keep the first few "sink" tokens plus a recent window) so the cache stays bounded over very long sessions.
Bottlenecks & scaling
| Bottleneck | Why it bites | Mitigation |
|---|---|---|
| HBM capacity | Weights + KV share ~80 GB/GPU; KV grows with load | Quantize weights + KV; tensor/pipeline parallelism; paging; offload |
| Fragmentation | Contiguous max-length reservation wastes 60–80% | PagedAttention fixed blocks (waste under 4%) |
| Long-context blow-up | KV is linear in length; 128K ≈ 40 GB/seq (GQA) | GQA/MQA, sliding window + sinks, quantized KV, MLA |
| Swap bandwidth | PCIe ≈ tens of GB/s gates swap-out/in | Prefer recompute; NVLink; overlap copy with compute; keep hot blocks resident |
| Concurrency ceiling | At long context, serving is memory-bound, not FLOP-bound | Maximize batch via paging + sharing + continuous batching |
| Kernel gather cost | Reading non-contiguous blocks adds indirection | Tuned PagedAttention kernels; block-size tuning (e.g. 16) |
Summary
The KV cache is the memory that makes autoregressive decoding cheap, and managing it is what makes LLM serving economical. Treat HBM like an operating system would: page the cache into fixed blocks with a per-sequence block table to erase fragmentation; share common prefixes by reference with copy-on-write; preempt under pressure with recompute-or-swap and continuous batching; and shrink the cache up front with GQA/MQA, quantization, sliding windows, and prefix reuse. Do all four and a fixed GPU budget serves many times the concurrent sequences — the difference between a model that is too expensive to deploy and one that is not.