AI / ML Infrastructure
Continuous Batching Inference Engine
Static batching locks a batch until its slowest sequence finishes, so a request that stops at 20 tokens keeps an idle GPU slot while its neighbors generate hundreds more. Continuous batching — also called iteration-level or in-flight batching — flips the serving loop: the scheduler runs one decode step for the whole batch, then retires finished sequences and admits new ones every iteration. The batch composition changes constantly and no request waits for another, which is how engines like Orca, vLLM, and TGI push a single GPU replica from under 40% utilization to 80%+.
The problem: static batching wastes the GPU
LLM inference is throughput-bound: you batch requests so a single read of the multi-gigabyte weights feeds many sequences at once and keeps the tensor cores busy. But decoding is autoregressive — each forward step produces exactly one token per sequence — and output lengths vary wildly. One request emits a 20-token answer; the request next to it streams a 2,000-token essay.
Static batching assembles N requests,
runs the entire batch to completion, and returns them together. The
batch is therefore locked for the duration of its
slowest member. A sequence that hit its stop token at
step 20 keeps occupying its slot through step 2,000: the GPU computes
wasted rows (or padding), no new request can enter until the whole
batch drains, and arriving requests sit in the queue for a full
batch-runtime before they even start. The result is GPU utilization
that routinely falls below 40%, inflated tail latency for short
requests, and queueing latency that scales with the longest generation
in the batch.
The fix is to stop treating a request as the unit of scheduling and instead schedule a single iteration (one decode step) at a time — admitting and evicting sequences continuously.
| Functional requirements | Non-functional requirements |
|---|---|
| Accept a continuous stream of generation requests (prompt, sampling params, max tokens). | Maximize throughput (tokens/sec) and GPU utilization on a fixed replica. |
| Run them on one model replica and stream tokens back as they are produced. | Hold time-to-first-token (TTFT) and time-per-output-token (TPOT) within SLA. |
| Admit new requests and release finished ones without draining the batch. | Stay within a bounded KV-cache memory budget — never OOM the GPU. |
| Support variable prompt and output lengths; honor per-request stop conditions. | Fairness: no request starves behind a long-running neighbor. |
Key insight
The waste in static batching is structural, not a tuning problem: as long as the whole batch must finish together, the slowest sequence dictates everyone's latency and the GPU's idle time. You have to change the granularity of scheduling, not just the batch size.
Static, dynamic, and continuous batching
There are three points on the spectrum, distinguished by when the batch is allowed to change:
- Static batching — a fixed batch is formed once and runs to completion. Simplest to implement, but utilization collapses under variable output lengths.
- Dynamic batching (server-side request batching, e.g. Triton) — the server waits a short window to gather arrivals into a batch, then runs that batch to completion. It amortizes the cost of forming batches, but each batch is still completion-locked, so the head-of-line problem just moves to the batch granularity. Great for fixed-shape models (vision, embeddings); poor for autoregressive LLMs.
- Continuous / iteration-level / in-flight batching — introduced by Orca and popularized by vLLM and TGI. The scheduling unit is one forward iteration, not one request. After every step the scheduler retires finished sequences (freeing their KV cache) and admits waiting ones, so the batch composition is re-decided each iteration and no sequence ever waits for another to finish.
| Dimension | Static | Dynamic | Continuous |
|---|---|---|---|
| Scheduling unit | One request | One request batch | One iteration (decode step) |
| When batch composition changes | Never (until full drain) | When the batch completes | Every iteration |
| GPU slot when a seq finishes early | Idle until batch drains | Idle until batch drains | Freed immediately, reused |
| New request wait time | Up to full batch runtime | Batch window + batch runtime | ~ one iteration |
| Best for | Uniform, short outputs | Fixed-shape, non-autoregressive | Variable-length LLM generation |
flowchart TD
subgraph STATIC["Static batching"]
A1["Batch of 4 starts together"] --> A2["Run whole batch to completion"]
A2 --> A3["Seq R1 stops at token 20"]
A2 --> A4["Seq R2 still running at token 800"]
A3 --> A5["R1 slot idles for ~780 steps"]
A4 --> A6["Only now: free batch, admit next 4"]
end
subgraph CONT["Continuous batching"]
B1["Run one decode step"] --> B2["Retire finished seqs, free KV"]
B2 --> B3["Admit waiting seqs that fit"]
B3 --> B1
end
The continuous loop has no "drain" state — the same three operations (step, retire, admit) repeat forever, and the batch is reshaped on every pass.
The engine loop: per-iteration scheduler
The serving engine is a tight while loop. Each pass has
three phases: a scheduling phase that decides the
running set, a single fused model forward, and a
post-processing phase that samples the next token,
appends it, checks stop conditions, streams output, and frees the KV
blocks of anything that just finished.
Admission is gated by the KV-cache budget. A sequence
can only join the batch if there are enough free KV blocks for its
next step; a brand-new request additionally needs blocks to hold the
KV for its entire prompt during prefill. vLLM tracks free blocks and
admits greedily up to a max_num_seqs /
max_num_batched_tokens
limit. Requests therefore move through a small state machine:
- WAITING — queued, no KV allocated yet.
- RUNNING — in the batch, KV allocated, producing tokens.
- PREEMPTED / SWAPPED — evicted under cache pressure; its KV is recomputed or swapped back later (see trade-offs).
The forward pass is "fused": in a single step the engine can run prefill for newly admitted requests and decode for in-flight ones together. Because each produced token is pushed to its client immediately (over SSE or a gRPC stream), TTFT for a request is essentially the latency of the iteration that completes its prefill.
sequenceDiagram
participant Q as Waiting queue
participant S as Scheduler
participant K as KV cache
participant G as GPU forward
participant C as Client
loop Every iteration
S->>K: How many KV blocks are free
K-->>S: Free block count
S->>Q: Admit requests that fit the budget
Q-->>S: New seqs join the running set
S->>G: Run one fused prefill plus decode step
G-->>S: One new token per running seq
S->>C: Stream tokens out
S->>S: Retire finished seqs and free their KV
end
Everything the scheduler needs is cheap to track: the running set, the waiting queue, the count of free KV blocks, and each sequence's current length and sampling state. The whole decision is a greedy pass per iteration — which is exactly why scheduling overhead itself can become a bottleneck at high token rates (covered below).
Mixing prefill and decode
Prefill and decode have opposite compute profiles, and that is the central tension the scheduler must manage:
- Prefill processes the whole prompt in parallel — large matrix multiplies, one big compute-bound step that is heavy on FLOPs.
- Decode produces a single token — a tiny matrix multiply that is memory-bandwidth-bound, dominated by re-reading the weights and the KV cache.
Naively admitting a long prompt forces a huge prefill step that stalls every in-flight decode in the same iteration, producing a visible TPOT spike for every user in the batch. This is the prefill–decode interference problem.
Chunked prefill (Sarathi-Serve, also in vLLM) is the standard remedy: split a long prompt into fixed token-budget chunks and process one chunk per iteration, co-scheduled with ongoing decodes. Each step's compute stays bounded, so decode latency stays smooth while utilization climbs. A per-step token budget then lets you bias the engine: protect TPOT by prioritizing decodes, or protect TTFT by prioritizing prefills. At larger scale, some systems go further and disaggregate prefill and decode onto separate replicas entirely so the two phases never contend.
flowchart LR
NR["New request, long prompt"] --> CP["Split prompt into prefill chunks"]
CP --> SCH["Scheduler: token budget per step"]
RUN["In-flight decodes"] --> SCH
SCH --> STEP["One mixed step: many decodes plus one prefill chunk"]
STEP --> TOK["Stream one token per decode"]
STEP --> SCH
Cross-reference
Admission and eviction here are entirely driven by KV-cache availability — see LLM KV-Cache Management for PagedAttention, block allocation, prefix sharing, and offloading. Continuous batching is the scheduling core of the broader LLM Inference Serving Platform, which wraps it with routing, tensor parallelism, autoscaling, and SLAs.
Throughput vs latency knobs
Continuous batching gives you a small set of dials, and almost every one trades throughput against latency or fairness. Tuning is the art of staying on the good edge of that Pareto frontier for your workload.
| Knob | Turn it up to gain | Cost / risk |
|---|---|---|
Max batch size (max_num_seqs)
|
More concurrency, higher token throughput | More KV pressure and longer per-step time → worse TPOT; risks OOM and preemption |
KV / token budget
(max_num_batched_tokens)
|
Bigger batches and longer contexts per step | Set too high → OOM; too low → the GPU is left underfilled |
| Prefill chunk size | Larger chunks finish prefill faster → better TTFT | Larger steps stall decodes → worse, spikier TPOT |
| Scheduling policy (FCFS vs priority) | Target SLAs or premium tiers | Priority can starve low-priority work; pure FCFS lets one huge job hog the cache |
| Preemption policy (recompute vs swap) | Frees KV under pressure to keep admitting | Recompute wastes prior compute; swap burns PCIe bandwidth and adds latency |
| Admission aggressiveness | Packs more sequences in → throughput | Over-admitting causes preemption thrash and blows up tail latency |
Watch for starvation
Greedy, throughput-first admission plus a few very long generations can starve short requests or trigger a preemption storm where sequences are evicted and re-admitted repeatedly. Bound the maximum tokens per request, add aging so waiting requests gain priority over time, and reserve headroom in the KV budget so the batch does not run permanently at the edge of OOM.
Bottlenecks & scaling
Once the iteration loop is in place, a predictable set of bottlenecks decides how far a single replica scales and when you must shard or add replicas.
| Bottleneck | Why it bites | Mitigation |
|---|---|---|
| KV-cache memory | Caps concurrent sequences and context length; grows linearly with batch × sequence length | PagedAttention paging, quantized KV, prefix sharing, CPU/NVMe offload, preemption |
| Long prefills block decodes | One big prefill step spikes TPOT for the entire batch | Chunked prefill, per-step token budget, prefill/decode disaggregation |
| Scheduling overhead | Per-iteration CPU/Python scheduling can fail to keep the GPU fed at high token rates | Keep the scheduler lean, overlap scheduling with GPU compute, async output handling, CUDA graphs |
| Tail latency / fairness | Head-of-line blocking from large jobs and preemption churn | Priority with aging, bound max tokens, separate queues per SLA class |
| Single-replica ceiling | One GPU or replica eventually saturates memory bandwidth | Tensor / pipeline parallelism within a replica; autoscale and load-balance across replicas with a KV-aware router |
Summary
Continuous batching turns the serving loop inside-out: it schedules per iteration, not per request, so a finished sequence frees its GPU slot and KV blocks immediately while new requests stream in. Paired with paged KV and chunked prefill, it lifts a single replica from under 40% to 80%+ utilization and multiplies throughput while holding TTFT and TPOT SLAs. The remaining engineering is all in the scheduler — balancing batch size, KV budget, and prefill chunking against latency and fairness.