System Design Notes All designs

AI / ML Infrastructure

LLM Inference Serving Platform

An LLM serving platform turns a rack of GPUs and a few hundred gigabytes of model weights into a low-latency, OpenAI-compatible token API. The entire design is a fight over one scarce resource — GPU high-bandwidth memory (HBM) and its bandwidth — mediated by two very different phases (prefill and decode), a batching scheduler that keeps the silicon busy, and a KV cache that quietly dictates how many users you can serve at once. Get those three right and everything else (cost, autoscaling, multi-model routing) follows.

Requirements

We are building the serving layer only — training, fine-tuning, and data pipelines are out of scope. The job is to take trained weights and answer generation requests fast and cheaply.

Functional Non-functional
Chat / completion API — OpenAI-compatible /v1/chat/completions and /v1/completions. Low TTFT (time-to-first-token): p50 under 300 ms, p99 under ~1 s. Dominated by prefill + queueing.
Token streaming — emit tokens over SSE as they are generated, not one giant blob at the end. High throughput — tokens/sec per replica and per GPU; low TPOT (time-per-output-token, e.g. under 40 ms ⇒ 25+ tok/s per user).
Multiple models — serve many models / sizes / versions (7B, 70B, embeddings) behind one endpoint; route by model name. High GPU utilization — keep model-FLOPs utilization (MFU) high; idle GPUs burn money.
Per-request controlsmax_tokens, temperature, stop, tool/function calling, JSON mode, logprobs. Cost / token — minimize $ per 1M tokens; this is the headline business metric.
Cancellation — client disconnect frees the slot immediately (decode is expensive). Autoscaling & isolation — absorb bursty traffic; fairness / SLO tiers across tenants; graceful degradation, never drop an in-flight stream.

The two latency numbers that matter

Users feel TTFT (how long until something appears) and TPOT (how fast it streams after that). They are governed by different phases with opposite hardware characteristics — which is the central tension of the whole platform and the reason the rest of this page exists.

Scale & back-of-envelope math

Everything starts with HBM. A weight in fp16 is 2 bytes, so weight memory = params × 2 bytes. On top of that, every active sequence holds a KV cache that grows with each token and competes with weights for the same 80 GB of GPU memory.

Model fp16 weights Min GPUs (80 GB) KV / token (GQA, fp16) Typical deployment
7B ~14 GB 1 ~128 KB 1 GPU; lots of room for big batches / long context.
70B ~140 GB 2 (TP=2) ~320 KB 4 GPUs (TP=4) in practice — weights leave little KV headroom on 2.
405B ~810 GB 8 in fp8 / 16 in fp16 ~500 KB Multi-node TP + PP; usually served in fp8 (~405 GB) on 8×H100.

KV cache size per token = 2 (K,V) × layers × kv_heads × head_dim × bytes. For a 70B model (80 layers, 8 KV heads via GQA, head_dim 128, fp16): 2 × 80 × 8 × 128 × 2 ≈ 320 KB/token. Multi-head attention (no GQA) is several times larger — which is exactly why modern models use grouped-query attention.

Worked example — KV cache eats the GPU

Serve 70B (TP=4 on H100, 4×80 = 320 GB). Weights take ~140 GB, leaving ~180 GB for KV cache and activations. At ~320 KB/token that is room for roughly 180 GB / 320 KB ≈ 560k tokens resident at once — e.g. ~270 concurrent sequences at 2K context, or far fewer at 32K context. Context length and concurrency trade directly against each other.

Throughput / GPU-count sizing. Target 1,000 concurrent users at 30 tok/s each ⇒ 1000 × 30 = 30,000 tok/s of aggregate decode. A single 70B replica with continuous batching delivers on the order of 3,000–6,000 tok/s aggregate, so you need roughly 30,000 / 4,000 ≈ 8 replicas × 4 GPUs = ~32 H100s for steady state, plus headroom for bursts and prefill spikes. The point of the math is not precision — it is to show that replica count is driven by aggregate tokens/sec, while concurrency is driven by KV memory.

High-level design

A request flows through a stateless gateway, a model-aware router, and a batching scheduler that feeds GPU replicas; generated tokens stream back over SSE. A model registry and an autoscaler sit to the side.

flowchart TD
    Client["Clients (chat / completion)"] --> GW["API Gateway"]
    GW --> RT["Router / Load Balancer"]
    REG["Model Registry"] --> RT
    RT --> SCH["Batching Scheduler"]
    subgraph Pool["Model Replica Pool"]
        subgraph R1["Replica A: 70B, TP=2"]
            G0["GPU 0 shard"]
            G1["GPU 1 shard"]
        end
        KV["KV Cache (paged)"]
    end
    SCH --> R1
    R1 --> KV
    KV --> R1
    R1 --> STR["Token Streamer (SSE)"]
    STR --> Client
    R1 --> MET["Metrics: TTFT, tokens/sec, util"]
    MET --> AS["Autoscaler"]
    AS --> Pool
      

Prefill vs decode — the key deep dive

A generation request runs in two phases with opposite hardware profiles. Understanding this split explains TTFT, TPOT, batching, and almost every optimization on this page.

Phase What happens Bottleneck Drives
Prefill Process all N prompt tokens in one forward pass; build the KV cache for the prompt. Large matrix–matrix multiplies (GEMM), high arithmetic intensity. Compute-bound (GPU FLOPs saturated) TTFT
Decode Autoregressive: generate one token, append its KV, repeat. Each step reads all weights + the whole KV cache from HBM but does tiny (one-token) compute — matrix–vector (GEMV). Memory-bandwidth-bound (HBM reads dominate) TPOT / tokens-per-sec

Why they differ. Prefill has many tokens to crunch in parallel, so the GPU's compute units are the limit. Decode produces a single token per step, so the GPU spends its time reading the model and KV cache out of HBM rather than computing — arithmetic intensity is low and bandwidth is the wall. The practical consequence: batching barely helps prefill but helps decode enormously, because many sequences can share a single read of the weights.

sequenceDiagram
    participant C as Client
    participant G as Gateway
    participant S as Scheduler
    participant P as Prefill GPUs
    participant D as Decode GPUs
    C->>G: POST prompt with N tokens
    G->>S: Enqueue request
    S->>P: Run prefill, compute bound
    P->>P: Build KV cache for N tokens
    P-->>C: First token at TTFT
    loop Decode until EOS
        S->>D: Schedule one step, batched
        D->>D: Read KV, compute next token
        D-->>C: Stream token
        D->>D: Append new token to KV
    end
    D-->>G: End of stream
      

Disaggregated prefill / decode serving

Because the phases stress different resources, a long prefill sharing a GPU with active decodes causes head-of-line blocking: one user's 8K-token prompt stalls everyone else's streaming. The fix is to run prefill and decode on separate GPU pools and transfer the KV cache between them. Each pool is tuned for its phase — prefill for TTFT, decode for tokens/sec — and a slow prefill never freezes ongoing streams.

flowchart LR
    Q["Request Queue"] --> PF["Prefill Pool (compute-bound)"]
    PF --> KVT["KV Cache Transfer"]
    KVT --> DC["Decode Pool (bandwidth-bound)"]
    DC --> OUT["Streamed Tokens"]
    PF --> TTFT["Optimize TTFT"]
    DC --> TPOT["Optimize TPOT and tokens-per-sec"]
      

Chunked prefill

The cheaper, single-pool alternative: split a long prompt's prefill into bounded chunks and interleave those chunks with decode steps in the same batch. No one prefill monopolizes the GPU, decodes keep flowing, and you smoothly trade a little TTFT for steadier TPOT. It pairs naturally with continuous batching below.

Trade-off in one line

Disaggregation gives the cleanest TTFT/TPOT isolation but pays a KV-transfer cost and needs two pools to size. Chunked prefill keeps one pool and is simpler, at the cost of slightly higher TTFT for very long prompts.

Batching — static vs continuous

Batching is how decode reaches high throughput: many sequences amortize one read of the weights from HBM. How you batch is the single biggest lever on GPU utilization.

Static batching gathers N requests, runs them together, and only returns the GPU when all N finish. But outputs have wildly different lengths, so the batch runs at the speed of its longest sequence while finished slots sit idle, and newly arrived requests wait for the whole batch to drain. Utilization craters under realistic traffic.

Continuous (in-flight) batching uses iteration-level scheduling: the scheduler makes a decision every decode step. The instant a sequence emits EOS it is evicted and a queued request is admitted into the freed slot — the GPU never waits for the slowest sequence. This is the default in modern engines (vLLM, TGI, TensorRT-LLM) and is what chunked prefill plugs into.

Dimension Static batching Continuous batching
Scheduling granularity Once per batch Every iteration (per token step)
GPU utilization Low — idles on finished slots High — slots refilled immediately
Wait for new requests Until batch fully drains Joins on the next step
Throughput vs latency Poor under mixed lengths Tunable via max batch size / max tokens — bigger batch = more throughput but higher TPOT and KV pressure

The lever to name in an interview

Throughput and latency are set by max batch size and max batched tokens. Raise them for cheaper tokens at the cost of TPOT and KV memory; lower them to protect tail latency. SLO-aware schedulers cap batch size per tier and reserve KV headroom.

Parallelism, quantization & speculative decoding

Big models do not fit on one GPU, and decode is bandwidth-bound — so we shard models across GPUs and shrink the bytes we must move.

Sharding the model

flowchart TD
    IN["Input hidden state"] --> SP["Split across GPUs"]
    SP --> A0["GPU 0: heads 0..k"]
    SP --> A1["GPU 1: heads k..2k"]
    A0 --> AR["All-Reduce sum"]
    A1 --> AR
    AR --> OUT["Merged output"]
      

Quantization

Precision Bytes / param Effect Notes
fp16 / bf16 2 Baseline quality Default serving precision.
fp8 / int8 1 ~2× smaller, ~2× less HBM to read ⇒ faster decode Near-lossless with good schemes (fp8 native on H100; SmoothQuant / AWQ for int8).
int4 0.5 ~4× smaller; big decode speedup Weight-only (GPTQ / AWQ); some quality loss — great for memory-constrained or latency-sensitive serving.

Because decode is memory-bandwidth-bound, quantization speeds it up directly — fewer bytes per weight means fewer bytes to stream from HBM each step. You can also quantize the KV cache (e.g. fp8) to fit more concurrent sequences.

Speculative decoding & paged KV cache

Bottlenecks & scaling

Every limit on this platform traces back to HBM capacity, HBM bandwidth, or the latency of bringing GPUs online.

Bottleneck Symptom Mitigation
KV cache memory Caps concurrent sequences & context length; admission stalls when blocks run out. Paged KV, GQA/MQA, quantized KV (fp8), prefix sharing, evict / swap cold sequences to host memory.
Decode is bandwidth-bound Low compute utilization; tokens/sec plateaus. Bigger batches, quantization, speculative decoding, fused / FlashAttention kernels.
Batch size vs latency Bigger batch raises TPOT and TTFT for everyone. SLO-aware max batch, chunked prefill, priority lanes, reserve KV headroom.
Cold starts Loading 100s of GB of weights takes minutes before a replica serves. Warm pools, stream weights from local NVMe, cache artifacts on the node, scale-to-zero only for the cold tail.
Autoscaling lag GPUs take minutes to provision; reactive scaling arrives after the spike. Scale on leading indicators (queue depth, TTFT, KV util), headroom buffers, predictive scaling, admission control + request queue to shed load gracefully.
Long-tail prompt / output lengths Head-of-line blocking; TTFT and TPOT interfere. Disaggregated prefill/decode, chunked prefill, length-aware scheduling.

Staff-level summary

An LLM serving platform is a memory system wearing a compute system's clothes. Prefill is compute-bound and sets TTFT; decode is memory-bandwidth-bound and sets TPOT. A continuous-batching scheduler keeps the GPU saturated; a paged KV cache decides how many users fit at once; tensor/pipeline parallelism makes big models fit; and quantization + speculative decoding attack the bandwidth wall on decode. Scale replicas by aggregate tokens/sec, scale concurrency by KV memory, and autoscale on leading indicators because GPUs are slow to wake. Nail those and cost-per-token — the metric the business actually cares about — falls out.