AI / ML Infrastructure

LLM Inference Serving Platform

An LLM serving platform turns a rack of GPUs and a few hundred gigabytes of model weights into a low-latency, OpenAI-compatible token API. The entire design is a fight over one scarce resource — GPU high-bandwidth memory (HBM) and its bandwidth — mediated by two very different phases (prefill and decode), a batching scheduler that keeps the silicon busy, and a KV cache that quietly dictates how many users you can serve at once. Get those three right and everything else (cost, autoscaling, multi-model routing) follows.

Requirements

We are building the serving layer only — training, fine-tuning, and data pipelines are out of scope. The job is to take trained weights and answer generation requests fast and cheaply.

Functional	Non-functional
Chat / completion API — OpenAI-compatible `/v1/chat/completions` and `/v1/completions`.	Low TTFT (time-to-first-token): p50 under 300 ms, p99 under ~1 s. Dominated by prefill + queueing.
Token streaming — emit tokens over SSE as they are generated, not one giant blob at the end.	High throughput — tokens/sec per replica and per GPU; low TPOT (time-per-output-token, e.g. under 40 ms ⇒ 25+ tok/s per user).
Multiple models — serve many models / sizes / versions (7B, 70B, embeddings) behind one endpoint; route by model name.	High GPU utilization — keep model-FLOPs utilization (MFU) high; idle GPUs burn money.
Per-request controls — `max_tokens`, `temperature`, `stop`, tool/function calling, JSON mode, logprobs.	Cost / token — minimize $ per 1M tokens; this is the headline business metric.
Cancellation — client disconnect frees the slot immediately (decode is expensive).	Autoscaling & isolation — absorb bursty traffic; fairness / SLO tiers across tenants; graceful degradation, never drop an in-flight stream.

The two latency numbers that matter

Users feel TTFT (how long until something appears) and TPOT (how fast it streams after that). They are governed by different phases with opposite hardware characteristics — which is the central tension of the whole platform and the reason the rest of this page exists.

Scale & back-of-envelope math

Everything starts with HBM. A weight in fp16 is 2 bytes, so weight memory = params × 2 bytes. On top of that, every active sequence holds a KV cache that grows with each token and competes with weights for the same 80 GB of GPU memory.

Model	fp16 weights	Min GPUs (80 GB)	KV / token (GQA, fp16)	Typical deployment
7B	~14 GB	1	~128 KB	1 GPU; lots of room for big batches / long context.
70B	~140 GB	2 (TP=2)	~320 KB	4 GPUs (TP=4) in practice — weights leave little KV headroom on 2.
405B	~810 GB	8 in fp8 / 16 in fp16	~500 KB	Multi-node TP + PP; usually served in `fp8` (~405 GB) on 8×H100.

KV cache size per token = 2 (K,V) × layers × kv_heads × head_dim × bytes. For a 70B model (80 layers, 8 KV heads via GQA, head_dim 128, fp16): 2 × 80 × 8 × 128 × 2 ≈ 320 KB/token. Multi-head attention (no GQA) is several times larger — which is exactly why modern models use grouped-query attention.

Worked example — KV cache eats the GPU

Serve 70B (TP=4 on H100, 4×80 = 320 GB). Weights take ~140 GB, leaving ~180 GB for KV cache and activations. At ~320 KB/token that is room for roughly 180 GB / 320 KB ≈ 560k tokens resident at once — e.g. ~270 concurrent sequences at 2K context, or far fewer at 32K context. Context length and concurrency trade directly against each other.

Throughput / GPU-count sizing. Target 1,000 concurrent users at 30 tok/s each ⇒ 1000 × 30 = 30,000 tok/s of aggregate decode. A single 70B replica with continuous batching delivers on the order of 3,000–6,000 tok/s aggregate, so you need roughly 30,000 / 4,000 ≈ 8 replicas × 4 GPUs = ~32 H100s for steady state, plus headroom for bursts and prefill spikes. The point of the math is not precision — it is to show that replica count is driven by aggregate tokens/sec, while concurrency is driven by KV memory.

High-level design

A request flows through a stateless gateway, a model-aware router, and a batching scheduler that feeds GPU replicas; generated tokens stream back over SSE. A model registry and an autoscaler sit to the side.

flowchart TD
    Client["Clients (chat / completion)"] --> GW["API Gateway"]
    GW --> RT["Router / Load Balancer"]
    REG["Model Registry"] --> RT
    RT --> SCH["Batching Scheduler"]
    subgraph Pool["Model Replica Pool"]
        subgraph R1["Replica A: 70B, TP=2"]
            G0["GPU 0 shard"]
            G1["GPU 1 shard"]
        end
        KV["KV Cache (paged)"]
    end
    SCH --> R1
    R1 --> KV
    KV --> R1
    R1 --> STR["Token Streamer (SSE)"]
    STR --> Client
    R1 --> MET["Metrics: TTFT, tokens/sec, util"]
    MET --> AS["Autoscaler"]
    AS --> Pool

API Gateway — auth, rate limiting, quota / billing, request validation, OpenAI-compatible schema. Stateless and horizontally scaled.
Router — resolves the model name to a replica set and load-balances on live capacity (queue depth, free KV blocks), not round-robin. Holds the SSE connection open and proxies the token stream.
Batching scheduler — the brain. Forms and continuously reshapes the GPU batch every iteration (see batching), admitting and evicting sequences to keep the GPU saturated within latency SLOs.
Model replicas — one model loaded across N GPUs via tensor parallelism (see parallelism). Each replica owns its KV cache memory.
KV cache — paged blocks of attention key/value state; the true limiter on concurrency.
Model registry — versioned weights in object storage (S3/GCS) + metadata (precision, parallel layout, tokenizer); drives rollouts and lets replicas pull the right artifact.
Autoscaler — scales replicas on leading signals (queue depth, TTFT, KV utilization), because spinning up a GPU + loading weights takes minutes.

Prefill vs decode — the key deep dive

A generation request runs in two phases with opposite hardware profiles. Understanding this split explains TTFT, TPOT, batching, and almost every optimization on this page.

Phase	What happens	Bottleneck	Drives
Prefill	Process all N prompt tokens in one forward pass; build the KV cache for the prompt. Large matrix–matrix multiplies (GEMM), high arithmetic intensity.	Compute-bound (GPU FLOPs saturated)	TTFT
Decode	Autoregressive: generate one token, append its KV, repeat. Each step reads all weights + the whole KV cache from HBM but does tiny (one-token) compute — matrix–vector (GEMV).	Memory-bandwidth-bound (HBM reads dominate)	TPOT / tokens-per-sec

Why they differ. Prefill has many tokens to crunch in parallel, so the GPU's compute units are the limit. Decode produces a single token per step, so the GPU spends its time reading the model and KV cache out of HBM rather than computing — arithmetic intensity is low and bandwidth is the wall. The practical consequence: batching barely helps prefill but helps decode enormously, because many sequences can share a single read of the weights.

sequenceDiagram
    participant C as Client
    participant G as Gateway
    participant S as Scheduler
    participant P as Prefill GPUs
    participant D as Decode GPUs
    C->>G: POST prompt with N tokens
    G->>S: Enqueue request
    S->>P: Run prefill, compute bound
    P->>P: Build KV cache for N tokens
    P-->>C: First token at TTFT
    loop Decode until EOS
        S->>D: Schedule one step, batched
        D->>D: Read KV, compute next token
        D-->>C: Stream token
        D->>D: Append new token to KV
    end
    D-->>G: End of stream

Disaggregated prefill / decode serving

Because the phases stress different resources, a long prefill sharing a GPU with active decodes causes head-of-line blocking: one user's 8K-token prompt stalls everyone else's streaming. The fix is to run prefill and decode on separate GPU pools and transfer the KV cache between them. Each pool is tuned for its phase — prefill for TTFT, decode for tokens/sec — and a slow prefill never freezes ongoing streams.

flowchart LR
    Q["Request Queue"] --> PF["Prefill Pool (compute-bound)"]
    PF --> KVT["KV Cache Transfer"]
    KVT --> DC["Decode Pool (bandwidth-bound)"]
    DC --> OUT["Streamed Tokens"]
    PF --> TTFT["Optimize TTFT"]
    DC --> TPOT["Optimize TPOT and tokens-per-sec"]

Chunked prefill

The cheaper, single-pool alternative: split a long prompt's prefill into bounded chunks and interleave those chunks with decode steps in the same batch. No one prefill monopolizes the GPU, decodes keep flowing, and you smoothly trade a little TTFT for steadier TPOT. It pairs naturally with continuous batching below.

Trade-off in one line

Disaggregation gives the cleanest TTFT/TPOT isolation but pays a KV-transfer cost and needs two pools to size. Chunked prefill keeps one pool and is simpler, at the cost of slightly higher TTFT for very long prompts.

Batching — static vs continuous

Batching is how decode reaches high throughput: many sequences amortize one read of the weights from HBM. How you batch is the single biggest lever on GPU utilization.

Static batching gathers N requests, runs them together, and only returns the GPU when all N finish. But outputs have wildly different lengths, so the batch runs at the speed of its longest sequence while finished slots sit idle, and newly arrived requests wait for the whole batch to drain. Utilization craters under realistic traffic.

Continuous (in-flight) batching uses iteration-level scheduling: the scheduler makes a decision every decode step. The instant a sequence emits EOS it is evicted and a queued request is admitted into the freed slot — the GPU never waits for the slowest sequence. This is the default in modern engines (vLLM, TGI, TensorRT-LLM) and is what chunked prefill plugs into.

Dimension	Static batching	Continuous batching
Scheduling granularity	Once per batch	Every iteration (per token step)
GPU utilization	Low — idles on finished slots	High — slots refilled immediately
Wait for new requests	Until batch fully drains	Joins on the next step
Throughput vs latency	Poor under mixed lengths	Tunable via max batch size / max tokens — bigger batch = more throughput but higher TPOT and KV pressure

The lever to name in an interview

Throughput and latency are set by max batch size and max batched tokens. Raise them for cheaper tokens at the cost of TPOT and KV memory; lower them to protect tail latency. SLO-aware schedulers cap batch size per tier and reserve KV headroom.

Parallelism, quantization & speculative decoding

Big models do not fit on one GPU, and decode is bandwidth-bound — so we shard models across GPUs and shrink the bytes we must move.

Sharding the model

Tensor parallelism (TP) — split each layer's weight matrices across GPUs (attention heads, MLP columns). Every GPU computes a slice, then an all-reduce merges results. Cuts both memory and latency, but needs a fast interconnect (NVLink); keep TP within a node.
Pipeline parallelism (PP) — split the layers into stages across GPUs/nodes; micro-batches flow down the pipeline. Scales across nodes for huge models but adds a pipeline bubble (idle time at the ends).
Replicas (data parallel) — independent copies of the whole model for throughput; this is what the autoscaler adds and removes.

flowchart TD
    IN["Input hidden state"] --> SP["Split across GPUs"]
    SP --> A0["GPU 0: heads 0..k"]
    SP --> A1["GPU 1: heads k..2k"]
    A0 --> AR["All-Reduce sum"]
    A1 --> AR
    AR --> OUT["Merged output"]

Quantization

Precision	Bytes / param	Effect	Notes
`fp16 / bf16`	2	Baseline quality	Default serving precision.
`fp8 / int8`	1	~2× smaller, ~2× less HBM to read ⇒ faster decode	Near-lossless with good schemes (fp8 native on H100; SmoothQuant / AWQ for int8).
`int4`	0.5	~4× smaller; big decode speedup	Weight-only (GPTQ / AWQ); some quality loss — great for memory-constrained or latency-sensitive serving.

Because decode is memory-bandwidth-bound, quantization speeds it up directly — fewer bytes per weight means fewer bytes to stream from HBM each step. You can also quantize the KV cache (e.g. fp8) to fit more concurrent sequences.

Speculative decoding & paged KV cache

Speculative decoding — a small, fast draft model proposes K tokens; the big model verifies them in a single forward pass and accepts the longest correct run. Since verification is the same cost as one decode step but yields several tokens, it raises tokens/sec on bandwidth-bound decode with no quality loss.
Paged KV cache (brief) — manage KV in fixed-size blocks like OS virtual memory (PagedAttention) instead of one contiguous buffer per sequence. This kills fragmentation, lets the batch pack tightly, and enables prefix sharing (reuse the KV of a common system prompt across requests). It is the mechanism that makes continuous batching memory-efficient — see the prefill/decode deep dive and the data-structures notes for the underlying block-table idea.

Bottlenecks & scaling

Every limit on this platform traces back to HBM capacity, HBM bandwidth, or the latency of bringing GPUs online.

Bottleneck	Symptom	Mitigation
KV cache memory	Caps concurrent sequences & context length; admission stalls when blocks run out.	Paged KV, GQA/MQA, quantized KV (fp8), prefix sharing, evict / swap cold sequences to host memory.
Decode is bandwidth-bound	Low compute utilization; tokens/sec plateaus.	Bigger batches, quantization, speculative decoding, fused / FlashAttention kernels.
Batch size vs latency	Bigger batch raises TPOT and TTFT for everyone.	SLO-aware max batch, chunked prefill, priority lanes, reserve KV headroom.
Cold starts	Loading 100s of GB of weights takes minutes before a replica serves.	Warm pools, stream weights from local NVMe, cache artifacts on the node, scale-to-zero only for the cold tail.
Autoscaling lag	GPUs take minutes to provision; reactive scaling arrives after the spike.	Scale on leading indicators (queue depth, TTFT, KV util), headroom buffers, predictive scaling, admission control + request queue to shed load gracefully.
Long-tail prompt / output lengths	Head-of-line blocking; TTFT and TPOT interfere.	Disaggregated prefill/decode, chunked prefill, length-aware scheduling.

Staff-level summary

An LLM serving platform is a memory system wearing a compute system's clothes. Prefill is compute-bound and sets TTFT; decode is memory-bandwidth-bound and sets TPOT. A continuous-batching scheduler keeps the GPU saturated; a paged KV cache decides how many users fit at once; tensor/pipeline parallelism makes big models fit; and quantization + speculative decoding attack the bandwidth wall on decode. Scale replicas by aggregate tokens/sec, scale concurrency by KV memory, and autoscale on leading indicators because GPUs are slow to wake. Nail those and cost-per-token — the metric the business actually cares about — falls out.