AI / ML Infrastructure
LLM Inference Serving Platform
An LLM serving platform turns a rack of GPUs and a few hundred gigabytes of model weights into a low-latency, OpenAI-compatible token API. The entire design is a fight over one scarce resource — GPU high-bandwidth memory (HBM) and its bandwidth — mediated by two very different phases (prefill and decode), a batching scheduler that keeps the silicon busy, and a KV cache that quietly dictates how many users you can serve at once. Get those three right and everything else (cost, autoscaling, multi-model routing) follows.
Requirements
We are building the serving layer only — training, fine-tuning, and data pipelines are out of scope. The job is to take trained weights and answer generation requests fast and cheaply.
| Functional | Non-functional |
|---|---|
Chat / completion API —
OpenAI-compatible /v1/chat/completions and
/v1/completions.
|
Low TTFT (time-to-first-token): p50 under 300 ms, p99 under ~1 s. Dominated by prefill + queueing. |
| Token streaming — emit tokens over SSE as they are generated, not one giant blob at the end. | High throughput — tokens/sec per replica and per GPU; low TPOT (time-per-output-token, e.g. under 40 ms ⇒ 25+ tok/s per user). |
| Multiple models — serve many models / sizes / versions (7B, 70B, embeddings) behind one endpoint; route by model name. | High GPU utilization — keep model-FLOPs utilization (MFU) high; idle GPUs burn money. |
Per-request controls —
max_tokens, temperature,
stop, tool/function calling, JSON mode, logprobs.
|
Cost / token — minimize $ per 1M tokens; this is the headline business metric. |
| Cancellation — client disconnect frees the slot immediately (decode is expensive). | Autoscaling & isolation — absorb bursty traffic; fairness / SLO tiers across tenants; graceful degradation, never drop an in-flight stream. |
The two latency numbers that matter
Users feel TTFT (how long until something appears) and TPOT (how fast it streams after that). They are governed by different phases with opposite hardware characteristics — which is the central tension of the whole platform and the reason the rest of this page exists.
Scale & back-of-envelope math
Everything starts with HBM. A weight in fp16 is
2 bytes, so weight memory =
params × 2 bytes. On top of that, every active
sequence holds a KV cache that grows with each token
and competes with weights for the same 80 GB of GPU memory.
| Model | fp16 weights | Min GPUs (80 GB) | KV / token (GQA, fp16) | Typical deployment |
|---|---|---|---|---|
| 7B | ~14 GB | 1 | ~128 KB | 1 GPU; lots of room for big batches / long context. |
| 70B | ~140 GB | 2 (TP=2) | ~320 KB | 4 GPUs (TP=4) in practice — weights leave little KV headroom on 2. |
| 405B | ~810 GB | 8 in fp8 / 16 in fp16 | ~500 KB |
Multi-node TP + PP; usually served in
fp8 (~405 GB) on 8×H100.
|
KV cache size per token =
2 (K,V) × layers × kv_heads × head_dim ×
bytes. For a 70B model (80 layers, 8 KV heads via GQA, head_dim 128,
fp16):
2 × 80 × 8 × 128 × 2 ≈
320 KB/token. Multi-head attention (no GQA) is several times larger — which
is exactly why modern models use
grouped-query attention.
Worked example — KV cache eats the GPU
Serve 70B (TP=4 on H100, 4×80 = 320 GB). Weights take
~140 GB, leaving ~180 GB for KV cache and activations. At
~320 KB/token that is room for roughly
180 GB / 320 KB ≈ 560k tokens resident
at once — e.g.
~270 concurrent sequences at 2K context, or far
fewer at 32K context.
Context length and concurrency trade directly against each
other.
Throughput / GPU-count sizing. Target 1,000
concurrent users at 30 tok/s each ⇒
1000 × 30 = 30,000 tok/s of aggregate decode.
A single 70B replica with continuous batching delivers on the order of
3,000–6,000 tok/s aggregate, so you need
roughly 30,000 / 4,000 ≈ 8 replicas × 4 GPUs
= ~32 H100s for steady state, plus headroom for
bursts and prefill spikes. The point of the math is not precision
— it is to show that
replica count is driven by aggregate tokens/sec,
while concurrency is driven by KV memory.
High-level design
A request flows through a stateless gateway, a model-aware router, and a batching scheduler that feeds GPU replicas; generated tokens stream back over SSE. A model registry and an autoscaler sit to the side.
flowchart TD
Client["Clients (chat / completion)"] --> GW["API Gateway"]
GW --> RT["Router / Load Balancer"]
REG["Model Registry"] --> RT
RT --> SCH["Batching Scheduler"]
subgraph Pool["Model Replica Pool"]
subgraph R1["Replica A: 70B, TP=2"]
G0["GPU 0 shard"]
G1["GPU 1 shard"]
end
KV["KV Cache (paged)"]
end
SCH --> R1
R1 --> KV
KV --> R1
R1 --> STR["Token Streamer (SSE)"]
STR --> Client
R1 --> MET["Metrics: TTFT, tokens/sec, util"]
MET --> AS["Autoscaler"]
AS --> Pool
- API Gateway — auth, rate limiting, quota / billing, request validation, OpenAI-compatible schema. Stateless and horizontally scaled.
-
Router — resolves the
modelname to a replica set and load-balances on live capacity (queue depth, free KV blocks), not round-robin. Holds the SSE connection open and proxies the token stream. - Batching scheduler — the brain. Forms and continuously reshapes the GPU batch every iteration (see batching), admitting and evicting sequences to keep the GPU saturated within latency SLOs.
- Model replicas — one model loaded across N GPUs via tensor parallelism (see parallelism). Each replica owns its KV cache memory.
- KV cache — paged blocks of attention key/value state; the true limiter on concurrency.
- Model registry — versioned weights in object storage (S3/GCS) + metadata (precision, parallel layout, tokenizer); drives rollouts and lets replicas pull the right artifact.
- Autoscaler — scales replicas on leading signals (queue depth, TTFT, KV utilization), because spinning up a GPU + loading weights takes minutes.
Prefill vs decode — the key deep dive
A generation request runs in two phases with opposite hardware profiles. Understanding this split explains TTFT, TPOT, batching, and almost every optimization on this page.
| Phase | What happens | Bottleneck | Drives |
|---|---|---|---|
| Prefill | Process all N prompt tokens in one forward pass; build the KV cache for the prompt. Large matrix–matrix multiplies (GEMM), high arithmetic intensity. | Compute-bound (GPU FLOPs saturated) | TTFT |
| Decode | Autoregressive: generate one token, append its KV, repeat. Each step reads all weights + the whole KV cache from HBM but does tiny (one-token) compute — matrix–vector (GEMV). | Memory-bandwidth-bound (HBM reads dominate) | TPOT / tokens-per-sec |
Why they differ. Prefill has many tokens to crunch in parallel, so the GPU's compute units are the limit. Decode produces a single token per step, so the GPU spends its time reading the model and KV cache out of HBM rather than computing — arithmetic intensity is low and bandwidth is the wall. The practical consequence: batching barely helps prefill but helps decode enormously, because many sequences can share a single read of the weights.
sequenceDiagram
participant C as Client
participant G as Gateway
participant S as Scheduler
participant P as Prefill GPUs
participant D as Decode GPUs
C->>G: POST prompt with N tokens
G->>S: Enqueue request
S->>P: Run prefill, compute bound
P->>P: Build KV cache for N tokens
P-->>C: First token at TTFT
loop Decode until EOS
S->>D: Schedule one step, batched
D->>D: Read KV, compute next token
D-->>C: Stream token
D->>D: Append new token to KV
end
D-->>G: End of stream
Disaggregated prefill / decode serving
Because the phases stress different resources, a long prefill sharing a GPU with active decodes causes head-of-line blocking: one user's 8K-token prompt stalls everyone else's streaming. The fix is to run prefill and decode on separate GPU pools and transfer the KV cache between them. Each pool is tuned for its phase — prefill for TTFT, decode for tokens/sec — and a slow prefill never freezes ongoing streams.
flowchart LR
Q["Request Queue"] --> PF["Prefill Pool (compute-bound)"]
PF --> KVT["KV Cache Transfer"]
KVT --> DC["Decode Pool (bandwidth-bound)"]
DC --> OUT["Streamed Tokens"]
PF --> TTFT["Optimize TTFT"]
DC --> TPOT["Optimize TPOT and tokens-per-sec"]
Chunked prefill
The cheaper, single-pool alternative: split a long prompt's prefill into bounded chunks and interleave those chunks with decode steps in the same batch. No one prefill monopolizes the GPU, decodes keep flowing, and you smoothly trade a little TTFT for steadier TPOT. It pairs naturally with continuous batching below.
Trade-off in one line
Disaggregation gives the cleanest TTFT/TPOT isolation but pays a KV-transfer cost and needs two pools to size. Chunked prefill keeps one pool and is simpler, at the cost of slightly higher TTFT for very long prompts.
Batching — static vs continuous
Batching is how decode reaches high throughput: many sequences amortize one read of the weights from HBM. How you batch is the single biggest lever on GPU utilization.
Static batching gathers N requests, runs them together, and only returns the GPU when all N finish. But outputs have wildly different lengths, so the batch runs at the speed of its longest sequence while finished slots sit idle, and newly arrived requests wait for the whole batch to drain. Utilization craters under realistic traffic.
Continuous (in-flight) batching uses iteration-level scheduling: the scheduler makes a decision every decode step. The instant a sequence emits EOS it is evicted and a queued request is admitted into the freed slot — the GPU never waits for the slowest sequence. This is the default in modern engines (vLLM, TGI, TensorRT-LLM) and is what chunked prefill plugs into.
| Dimension | Static batching | Continuous batching |
|---|---|---|
| Scheduling granularity | Once per batch | Every iteration (per token step) |
| GPU utilization | Low — idles on finished slots | High — slots refilled immediately |
| Wait for new requests | Until batch fully drains | Joins on the next step |
| Throughput vs latency | Poor under mixed lengths | Tunable via max batch size / max tokens — bigger batch = more throughput but higher TPOT and KV pressure |
The lever to name in an interview
Throughput and latency are set by max batch size and max batched tokens. Raise them for cheaper tokens at the cost of TPOT and KV memory; lower them to protect tail latency. SLO-aware schedulers cap batch size per tier and reserve KV headroom.
Parallelism, quantization & speculative decoding
Big models do not fit on one GPU, and decode is bandwidth-bound — so we shard models across GPUs and shrink the bytes we must move.
Sharding the model
- Tensor parallelism (TP) — split each layer's weight matrices across GPUs (attention heads, MLP columns). Every GPU computes a slice, then an all-reduce merges results. Cuts both memory and latency, but needs a fast interconnect (NVLink); keep TP within a node.
- Pipeline parallelism (PP) — split the layers into stages across GPUs/nodes; micro-batches flow down the pipeline. Scales across nodes for huge models but adds a pipeline bubble (idle time at the ends).
- Replicas (data parallel) — independent copies of the whole model for throughput; this is what the autoscaler adds and removes.
flowchart TD
IN["Input hidden state"] --> SP["Split across GPUs"]
SP --> A0["GPU 0: heads 0..k"]
SP --> A1["GPU 1: heads k..2k"]
A0 --> AR["All-Reduce sum"]
A1 --> AR
AR --> OUT["Merged output"]
Quantization
| Precision | Bytes / param | Effect | Notes |
|---|---|---|---|
fp16 / bf16 |
2 | Baseline quality | Default serving precision. |
fp8 / int8 |
1 | ~2× smaller, ~2× less HBM to read ⇒ faster decode | Near-lossless with good schemes (fp8 native on H100; SmoothQuant / AWQ for int8). |
int4 |
0.5 | ~4× smaller; big decode speedup | Weight-only (GPTQ / AWQ); some quality loss — great for memory-constrained or latency-sensitive serving. |
Because decode is memory-bandwidth-bound, quantization speeds it up directly — fewer bytes per weight means fewer bytes to stream from HBM each step. You can also quantize the KV cache (e.g. fp8) to fit more concurrent sequences.
Speculative decoding & paged KV cache
- Speculative decoding — a small, fast draft model proposes K tokens; the big model verifies them in a single forward pass and accepts the longest correct run. Since verification is the same cost as one decode step but yields several tokens, it raises tokens/sec on bandwidth-bound decode with no quality loss.
- Paged KV cache (brief) — manage KV in fixed-size blocks like OS virtual memory (PagedAttention) instead of one contiguous buffer per sequence. This kills fragmentation, lets the batch pack tightly, and enables prefix sharing (reuse the KV of a common system prompt across requests). It is the mechanism that makes continuous batching memory-efficient — see the prefill/decode deep dive and the data-structures notes for the underlying block-table idea.
Bottlenecks & scaling
Every limit on this platform traces back to HBM capacity, HBM bandwidth, or the latency of bringing GPUs online.
| Bottleneck | Symptom | Mitigation |
|---|---|---|
| KV cache memory | Caps concurrent sequences & context length; admission stalls when blocks run out. | Paged KV, GQA/MQA, quantized KV (fp8), prefix sharing, evict / swap cold sequences to host memory. |
| Decode is bandwidth-bound | Low compute utilization; tokens/sec plateaus. | Bigger batches, quantization, speculative decoding, fused / FlashAttention kernels. |
| Batch size vs latency | Bigger batch raises TPOT and TTFT for everyone. | SLO-aware max batch, chunked prefill, priority lanes, reserve KV headroom. |
| Cold starts | Loading 100s of GB of weights takes minutes before a replica serves. | Warm pools, stream weights from local NVMe, cache artifacts on the node, scale-to-zero only for the cold tail. |
| Autoscaling lag | GPUs take minutes to provision; reactive scaling arrives after the spike. | Scale on leading indicators (queue depth, TTFT, KV util), headroom buffers, predictive scaling, admission control + request queue to shed load gracefully. |
| Long-tail prompt / output lengths | Head-of-line blocking; TTFT and TPOT interfere. | Disaggregated prefill/decode, chunked prefill, length-aware scheduling. |
Staff-level summary
An LLM serving platform is a memory system wearing a compute system's clothes. Prefill is compute-bound and sets TTFT; decode is memory-bandwidth-bound and sets TPOT. A continuous-batching scheduler keeps the GPU saturated; a paged KV cache decides how many users fit at once; tensor/pipeline parallelism makes big models fit; and quantization + speculative decoding attack the bandwidth wall on decode. Scale replicas by aggregate tokens/sec, scale concurrency by KV memory, and autoscale on leading indicators because GPUs are slow to wake. Nail those and cost-per-token — the metric the business actually cares about — falls out.