AI / ML Infrastructure

Continuous Batching Inference Engine

Static batching locks a batch until its slowest sequence finishes, so a request that stops at 20 tokens keeps an idle GPU slot while its neighbors generate hundreds more. Continuous batching — also called iteration-level or in-flight batching — flips the serving loop: the scheduler runs one decode step for the whole batch, then retires finished sequences and admits new ones every iteration. The batch composition changes constantly and no request waits for another, which is how engines like Orca, vLLM, and TGI push a single GPU replica from under 40% utilization to 80%+.

The problem: static batching wastes the GPU

LLM inference is throughput-bound: you batch requests so a single read of the multi-gigabyte weights feeds many sequences at once and keeps the tensor cores busy. But decoding is autoregressive — each forward step produces exactly one token per sequence — and output lengths vary wildly. One request emits a 20-token answer; the request next to it streams a 2,000-token essay.

Static batching assembles N requests, runs the entire batch to completion, and returns them together. The batch is therefore locked for the duration of its slowest member. A sequence that hit its stop token at step 20 keeps occupying its slot through step 2,000: the GPU computes wasted rows (or padding), no new request can enter until the whole batch drains, and arriving requests sit in the queue for a full batch-runtime before they even start. The result is GPU utilization that routinely falls below 40%, inflated tail latency for short requests, and queueing latency that scales with the longest generation in the batch.

The fix is to stop treating a request as the unit of scheduling and instead schedule a single iteration (one decode step) at a time — admitting and evicting sequences continuously.

Functional requirements	Non-functional requirements
Accept a continuous stream of generation requests (prompt, sampling params, max tokens).	Maximize throughput (tokens/sec) and GPU utilization on a fixed replica.
Run them on one model replica and stream tokens back as they are produced.	Hold time-to-first-token (TTFT) and time-per-output-token (TPOT) within SLA.
Admit new requests and release finished ones without draining the batch.	Stay within a bounded KV-cache memory budget — never OOM the GPU.
Support variable prompt and output lengths; honor per-request stop conditions.	Fairness: no request starves behind a long-running neighbor.

Key insight

The waste in static batching is structural, not a tuning problem: as long as the whole batch must finish together, the slowest sequence dictates everyone's latency and the GPU's idle time. You have to change the granularity of scheduling, not just the batch size.

Static, dynamic, and continuous batching

There are three points on the spectrum, distinguished by when the batch is allowed to change:

Static batching — a fixed batch is formed once and runs to completion. Simplest to implement, but utilization collapses under variable output lengths.
Dynamic batching (server-side request batching, e.g. Triton) — the server waits a short window to gather arrivals into a batch, then runs that batch to completion. It amortizes the cost of forming batches, but each batch is still completion-locked, so the head-of-line problem just moves to the batch granularity. Great for fixed-shape models (vision, embeddings); poor for autoregressive LLMs.
Continuous / iteration-level / in-flight batching — introduced by Orca and popularized by vLLM and TGI. The scheduling unit is one forward iteration, not one request. After every step the scheduler retires finished sequences (freeing their KV cache) and admits waiting ones, so the batch composition is re-decided each iteration and no sequence ever waits for another to finish.

Dimension	Static	Dynamic	Continuous
Scheduling unit	One request	One request batch	One iteration (decode step)
When batch composition changes	Never (until full drain)	When the batch completes	Every iteration
GPU slot when a seq finishes early	Idle until batch drains	Idle until batch drains	Freed immediately, reused
New request wait time	Up to full batch runtime	Batch window + batch runtime	~ one iteration
Best for	Uniform, short outputs	Fixed-shape, non-autoregressive	Variable-length LLM generation

flowchart TD
    subgraph STATIC["Static batching"]
        A1["Batch of 4 starts together"] --> A2["Run whole batch to completion"]
        A2 --> A3["Seq R1 stops at token 20"]
        A2 --> A4["Seq R2 still running at token 800"]
        A3 --> A5["R1 slot idles for ~780 steps"]
        A4 --> A6["Only now: free batch, admit next 4"]
    end
    subgraph CONT["Continuous batching"]
        B1["Run one decode step"] --> B2["Retire finished seqs, free KV"]
        B2 --> B3["Admit waiting seqs that fit"]
        B3 --> B1
    end

The continuous loop has no "drain" state — the same three operations (step, retire, admit) repeat forever, and the batch is reshaped on every pass.

The engine loop: per-iteration scheduler

The serving engine is a tight while loop. Each pass has three phases: a scheduling phase that decides the running set, a single fused model forward, and a post-processing phase that samples the next token, appends it, checks stop conditions, streams output, and frees the KV blocks of anything that just finished.

Admission is gated by the KV-cache budget. A sequence can only join the batch if there are enough free KV blocks for its next step; a brand-new request additionally needs blocks to hold the KV for its entire prompt during prefill. vLLM tracks free blocks and admits greedily up to a max_num_seqs / max_num_batched_tokens limit. Requests therefore move through a small state machine:

WAITING — queued, no KV allocated yet.
RUNNING — in the batch, KV allocated, producing tokens.
PREEMPTED / SWAPPED — evicted under cache pressure; its KV is recomputed or swapped back later (see trade-offs).

The forward pass is "fused": in a single step the engine can run prefill for newly admitted requests and decode for in-flight ones together. Because each produced token is pushed to its client immediately (over SSE or a gRPC stream), TTFT for a request is essentially the latency of the iteration that completes its prefill.

sequenceDiagram
    participant Q as Waiting queue
    participant S as Scheduler
    participant K as KV cache
    participant G as GPU forward
    participant C as Client
    loop Every iteration
        S->>K: How many KV blocks are free
        K-->>S: Free block count
        S->>Q: Admit requests that fit the budget
        Q-->>S: New seqs join the running set
        S->>G: Run one fused prefill plus decode step
        G-->>S: One new token per running seq
        S->>C: Stream tokens out
        S->>S: Retire finished seqs and free their KV
    end

Everything the scheduler needs is cheap to track: the running set, the waiting queue, the count of free KV blocks, and each sequence's current length and sampling state. The whole decision is a greedy pass per iteration — which is exactly why scheduling overhead itself can become a bottleneck at high token rates (covered below).

Mixing prefill and decode

Prefill and decode have opposite compute profiles, and that is the central tension the scheduler must manage:

Prefill processes the whole prompt in parallel — large matrix multiplies, one big compute-bound step that is heavy on FLOPs.
Decode produces a single token — a tiny matrix multiply that is memory-bandwidth-bound, dominated by re-reading the weights and the KV cache.

Naively admitting a long prompt forces a huge prefill step that stalls every in-flight decode in the same iteration, producing a visible TPOT spike for every user in the batch. This is the prefill–decode interference problem.

Chunked prefill (Sarathi-Serve, also in vLLM) is the standard remedy: split a long prompt into fixed token-budget chunks and process one chunk per iteration, co-scheduled with ongoing decodes. Each step's compute stays bounded, so decode latency stays smooth while utilization climbs. A per-step token budget then lets you bias the engine: protect TPOT by prioritizing decodes, or protect TTFT by prioritizing prefills. At larger scale, some systems go further and disaggregate prefill and decode onto separate replicas entirely so the two phases never contend.

flowchart LR
    NR["New request, long prompt"] --> CP["Split prompt into prefill chunks"]
    CP --> SCH["Scheduler: token budget per step"]
    RUN["In-flight decodes"] --> SCH
    SCH --> STEP["One mixed step: many decodes plus one prefill chunk"]
    STEP --> TOK["Stream one token per decode"]
    STEP --> SCH

Cross-reference

Admission and eviction here are entirely driven by KV-cache availability — see LLM KV-Cache Management for PagedAttention, block allocation, prefix sharing, and offloading. Continuous batching is the scheduling core of the broader LLM Inference Serving Platform, which wraps it with routing, tensor parallelism, autoscaling, and SLAs.

Throughput vs latency knobs

Continuous batching gives you a small set of dials, and almost every one trades throughput against latency or fairness. Tuning is the art of staying on the good edge of that Pareto frontier for your workload.

Knob	Turn it up to gain	Cost / risk
Max batch size (`max_num_seqs`)	More concurrency, higher token throughput	More KV pressure and longer per-step time → worse TPOT; risks OOM and preemption
KV / token budget (`max_num_batched_tokens`)	Bigger batches and longer contexts per step	Set too high → OOM; too low → the GPU is left underfilled
Prefill chunk size	Larger chunks finish prefill faster → better TTFT	Larger steps stall decodes → worse, spikier TPOT
Scheduling policy (FCFS vs priority)	Target SLAs or premium tiers	Priority can starve low-priority work; pure FCFS lets one huge job hog the cache
Preemption policy (recompute vs swap)	Frees KV under pressure to keep admitting	Recompute wastes prior compute; swap burns PCIe bandwidth and adds latency
Admission aggressiveness	Packs more sequences in → throughput	Over-admitting causes preemption thrash and blows up tail latency

Watch for starvation

Greedy, throughput-first admission plus a few very long generations can starve short requests or trigger a preemption storm where sequences are evicted and re-admitted repeatedly. Bound the maximum tokens per request, add aging so waiting requests gain priority over time, and reserve headroom in the KV budget so the batch does not run permanently at the edge of OOM.

Bottlenecks & scaling

Once the iteration loop is in place, a predictable set of bottlenecks decides how far a single replica scales and when you must shard or add replicas.

Bottleneck	Why it bites	Mitigation
KV-cache memory	Caps concurrent sequences and context length; grows linearly with batch × sequence length	PagedAttention paging, quantized KV, prefix sharing, CPU/NVMe offload, preemption
Long prefills block decodes	One big prefill step spikes TPOT for the entire batch	Chunked prefill, per-step token budget, prefill/decode disaggregation
Scheduling overhead	Per-iteration CPU/Python scheduling can fail to keep the GPU fed at high token rates	Keep the scheduler lean, overlap scheduling with GPU compute, async output handling, CUDA graphs
Tail latency / fairness	Head-of-line blocking from large jobs and preemption churn	Priority with aging, bound max tokens, separate queues per SLA class
Single-replica ceiling	One GPU or replica eventually saturates memory bandwidth	Tensor / pipeline parallelism within a replica; autoscale and load-balance across replicas with a KV-aware router

Summary

Continuous batching turns the serving loop inside-out: it schedules per iteration, not per request, so a finished sequence frees its GPU slot and KV blocks immediately while new requests stream in. Paired with paged KV and chunked prefill, it lifts a single replica from under 40% to 80%+ utilization and multiplies throughput while holding TTFT and TPOT SLAs. The remaining engineering is all in the scheduler — balancing batch size, KV budget, and prefill chunking against latency and fairness.