AI / ML Infrastructure
Multi-Datacenter Model Serving with Latency SLAs
Users are everywhere; GPUs are not. A global inference service has to answer requests from every continent inside a tight p99 latency SLA, yet the accelerators that run a 70B model are scarce, expensive, and unevenly distributed — plentiful in one region, almost absent in another. The whole design is a negotiation between three forces: geography (speed of light sets a network floor), capacity (a region can run out of GPUs), and cost (you cannot replicate every big model into every region). Get routing, overflow, and failover right and you keep the SLA without paying to over-provision the planet.
Requirements
Scope is the serving and routing layer across regions: training, the model registry internals, and single-replica inference mechanics are covered elsewhere (see LLM inference platform and model registry). Here the job is to take an already-deployed model and answer a global stream of requests within an SLA while GPUs are scarce and lumpy across regions.
| Functional | Non-functional |
|---|---|
Route to a model replica — resolve
(model, version) to a healthy replica
that has spare capacity, and proxy the (often streaming)
response back to the caller.
|
p99 latency SLA — e.g. region-local p99 under 800 ms, end-to-end p99 under 1.2 s including cross-region hops. The SLA, not average load, sizes the system. |
| Multi-region deployment — the same model served from several regions (US, EU, APAC) so most users are handled close to home. | High availability — a full region outage must not drop global service; target 99.95%+ with no single-region dependency on the request path. |
| Failover — when a region is unhealthy or saturated, reroute its traffic to another region automatically, without manual paging. | GPU cost efficiency — keep accelerator utilization high; idle GPUs in a low-traffic region are the most expensive mistake in the design. |
| Affinity-aware routing — keep a session (or shared prompt prefix) pinned to a replica so the KV cache can be reused instead of recomputed. | Capacity in scarce regions — degrade gracefully (overflow, smaller model, queue) rather than fail when a region simply does not have enough GPUs. |
The one-line framing
This is not a sharding problem — every replica can answer any request. It is a placement and admission problem: which region, which replica, and what to do when the nearest one is full. Latency is mostly geography plus queueing; availability and cost come from how you spread and spill capacity.
Scale math
Concrete numbers keep the design honest. The table below is a plausible global LLM service; the exact figures matter less than the relationships — that one region holds most of the GPUs, and that a single cross-region hop can eat a large slice of the budget.
| Quantity | Estimate | Why it matters |
|---|---|---|
| Global peak QPS | ~50,000 req/s | Diurnal; the peak rotates around the globe by time zone. |
| Serving regions | 3 primary (US, EU, APAC) + edge PoPs | Few GPU regions; many edge points for TLS / routing. |
| Per-region GPU capacity | US ~600, EU ~300, APAC ~150 GPUs | Uneven — APAC is the constrained region. |
| Cross-region RTT | 50–150 ms | An overflow hop can cost more than local inference queueing. |
| Latency budget (p99 800 ms) | net ~40 + queue ~100 + inference ~600 + margin ~60 ms | Inference dominates; cross-region overflow blows the network slice. |
| Model footprint / region | 70B ~140 GB weights + tens of GB KV cache | Why you cannot replicate every large model into every region. |
Budget reading
With a 800 ms p99 SLA and ~600 ms of inference, you have only ~200 ms left for network plus queue. A local request spends it on queueing; a 120 ms overflow hop to another region leaves almost nothing for queueing, so overflow is only viable while the inference itself is fast and the neighbor is lightly loaded.
Global routing
Routing is the heart of the system, and it happens in two tiers. A global tier gets the request to the right region; a regional tier picks the right replica and decides whether to serve, overflow, or shed. The goal at every hop is the same: the lowest-latency replica that still has capacity, without violating the SLA.
Tier 1 — getting to a region (GeoDNS / anycast + L7)
- GeoDNS / anycast — DNS resolves (or anycast BGP routes) the user to the network-nearest edge by geography and measured latency. This is coarse and cache-friendly but blind to live GPU load; it answers “which region is close”, not “which region has a free GPU”.
- L7 global load balancer — at the edge, an application-aware proxy makes the real decision. It knows each region's health and a recent capacity signal (queue depth, free KV blocks) and can steer to the nearest region with capacity, not merely the nearest region.
- Why two layers — GeoDNS handles the 95% common case cheaply and is hard to change quickly (DNS TTLs); the L7 layer handles the fast-moving capacity decisions and overflow.
flowchart TD
U["Users (global)"] --> DNS["GeoDNS / Anycast"]
DNS --> GLB["Global L7 Load Balancer"]
GLB --> GUS["US Gateway"]
GLB --> GEU["EU Gateway"]
GLB --> GAP["APAC Gateway"]
GUS --> RUS["US GPU Replicas"]
GEU --> REU["EU GPU Replicas"]
GAP --> RAP["APAC GPU Replicas"]
GAP -.->|overflow| GUS
GEU -.->|overflow| GUS
RUS --> KUS["KV Cache US"]
REU --> KEU["KV Cache EU"]
RAP --> KAP["KV Cache APAC"]
Tier 2 — choosing a replica inside a region
Once inside a region, a stateless gateway runs the admission decision. Three ideas dominate, in priority order:
-
Load-aware, not round-robin — pick the
replica with the most headroom by a live signal:
queue_depth,free_kv_blocks, and in-flight decode slots. Round-robin ignores that one replica may be stuck on a long generation while another is idle. - Session / prefix affinity — route a multi-turn session, or requests sharing a long system prompt, to the same replica so its KV cache is reused instead of recomputed. This can remove most of the prefill cost — a first-class latency win, not a micro-optimization.
- Overflow / spillover — if no local replica can meet the SLA, spill to the next-nearest region only if the extra RTT still fits the budget; otherwise queue briefly or shed with backpressure. Overflow must be bounded so a busy region cannot tip its neighbor over (see failover).
flowchart TD
R["Request hits region gateway"] --> A{"Local capacity free?"}
A -->|yes| B{"Affinity key present?"}
B -->|hit| L["Sticky replica (reuse KV)"]
B -->|miss| LL["Least-loaded local replica"]
A -->|no| C{"Neighbor within budget?"}
C -->|yes| O["Overflow to neighbor region"]
C -->|no| D["Queue briefly, else shed (backpressure)"]
Affinity vs balance is a real tension
Hard affinity maximizes KV reuse but creates hotspots; pure load-balancing maximizes utilization but recomputes prefixes. The staff answer is soft affinity: prefer the sticky replica, but break stickiness once its queue crosses a threshold, trading a little cache reuse for SLA protection.
Capacity & autoscaling across regions
Because GPUs are scarce and unevenly placed, capacity planning is a global optimization, not a per-region one. Three levers interact: where models live, how each region scales, and what happens at the edge of capacity.
- Follow-the-sun capacity — peak load rotates with the work day. Rather than provision every region for its local peak, shift a shared GPU pool (or autoscale aggressively) toward whichever region is currently busy, and let overflow cover the seams.
- Autoscaling on the right signal — scale on queue depth and KV-cache pressure, not CPU. But GPU scale-up is slow (cold starts: pull ~140 GB of weights, warm the runtime), so keep a warm buffer and scale early.
- Queue vs shed vs degrade — at saturation you may queue (bounded, only if the SLA still allows), shed lowest-priority traffic, or degrade to a cheaper model. Never queue unbounded — that converts a capacity problem into a latency-SLA breach.
- Prioritization — protect paying / interactive traffic with priority classes; batch and best-effort work yields first when GPUs are tight.
Placement strategy — the core trade-off
Replicating a 140 GB model into every region buys the best latency but is the most expensive option, and it strands GPUs in low-traffic regions. The realistic designs sit between full replication and a single central pool.
| Strategy | Latency | GPU cost | Use when |
|---|---|---|---|
| Replicate everywhere | Best (always local) | Highest; idle GPUs in quiet regions | A few hot models with steady global demand. |
| Hub & spoke | Good locally, overflow pays RTT | Moderate; big models in 1–2 hubs | Large models with regional demand peaks. |
| Follow-the-sun | Good if pool tracks demand | Lowest for a global footprint | Strongly diurnal, time-zone-separated traffic. |
| Tiered models | Local for small, hop for large | Efficient; small model everywhere | Most traffic served by a small local model; large model central. |
Cost vs latency in one sentence
Every region you replicate a big model into is GPUs you must pay for at that region's peak, not its average — so replicate the small, hot, latency-critical models widely and keep the large, expensive ones in a few hubs reached by bounded overflow.
Model & version consistency
Multiple regions serving “the same model” must agree on which version they run, or identical prompts return different answers depending on geography. Rollout is therefore a distributed, staged change — never a simultaneous global flip.
- Artifact distribution — the model registry is the source of truth: immutable, content-addressed weights pushed to regional object stores / CDN (or a P2P fan-out) before any region is told to switch. Weights are pre-staged; the cutover is just a config change.
- Canary, region by region — promote one region (often the smallest, e.g. APAC) to the new version, watch SLO and quality metrics, then expand. A bad version is contained to one region, not the planet.
-
Consistent routing during rollout — while
vNandvN+1coexist, pin a session to one version so a user does not flip mid-conversation, and make overflow version-aware so a spilled request lands on a compatible replica. -
Config replication — routing tables, model
aliases, and feature flags propagate through a versioned control
plane with bounded staleness, so “
gpt-x = vN+1” becomes true everywhere within a known window.
flowchart LR
REG["Model Registry vN+1"] --> DIST["Artifact Distribution (CDN / P2P)"]
DIST --> S1["Canary: one region"]
S1 --> CHK{"SLO and quality OK?"}
CHK -->|yes| S2["Expand to more regions"]
S2 --> S3["Full fleet on vN+1"]
CHK -->|no| RB["Rollback to vN"]
Pre-stage, then flip
The expensive, slow step is moving 140 GB of weights to every region; the risky step is the cutover. Decouple them: distribute artifacts ahead of time so promotion is an instant, reversible pointer change — that is what makes per-region canary and fast rollback possible.
Failover & resilience
A region can fail outright (network partition, power, control-plane outage) or fail softly (GPUs saturated, latency climbing). The system must detect both, drain the bad region, and reroute — without tipping the rescuer into the same overload.
- Health checks & draining — continuous probes (liveness + a real inference canary) feed the global LB. An unhealthy region is drained: new traffic stops, in-flight requests finish, and the LB updates within seconds, not DNS-TTL minutes.
- Reroute to nearest healthy region — failed traffic shifts to the next-closest region with capacity. The extra RTT is now unavoidable, so the SLA may move to a documented degraded tier during the incident.
- Graceful degradation — if the rescuing region lacks GPUs for the full model, fall back to a smaller / quantized model, shorter max-tokens, or a cached answer. A correct, slightly weaker response beats a timeout.
- Avoid cascading overload — spillover is bounded by a rate limit and load shedding at the rescuer, plus circuit breakers so a sick region is not hammered by retries. Without caps, failover turns one region's outage into a global brownout.
sequenceDiagram
participant U as User (APAC)
participant G as Global LB
participant A as APAC Region
participant N as Neighbor (US)
participant H as Health Checker
H->>A: Probe healthz + canary
A-->>H: Timeout (unhealthy)
H->>G: Mark APAC drained
U->>G: Inference request
G->>N: Reroute to nearest healthy
N->>N: Admit, maybe smaller model
N-->>U: Response within degraded SLA
Note over G,N: Bounded spillover + circuit breaker prevent cascade
Failover is a capacity problem in disguise
Rerouting only works if somewhere has spare GPUs. Either keep N+1 regional headroom (expensive) or accept explicit degradation during a regional loss. Pick one on purpose and write it into the SLA — silent overload is the worst of both.
Bottlenecks & scaling
The recurring failure modes all trace back to the same root: capacity is finite and geography is fixed. Naming them — with the mitigation — is the staff-level move.
| Bottleneck | Symptom | Mitigation |
|---|---|---|
| Cross-region latency | Overflowed requests breach the SLA from RTT alone. | Route nearest-with-capacity; only overflow within budget; place replicas closer to demand. |
| GPU scarcity & imbalance | One region (APAC) saturates while another idles. | Follow-the-sun pooling, autoscale on queue/KV pressure, tiered models. |
| Overflow overload | Spillover tips the neighbor; cascade / brownout. | Bounded spillover, load shedding, circuit breakers, priority classes. |
| Replication cost | Idle GPUs holding big models in quiet regions. | Hub & spoke for large models; replicate only small, hot ones widely. |
| Cold starts | Scale-up too slow (140 GB load) to catch a spike. | Warm buffer, pre-staged weights, scale early on leading indicators. |
| Version skew | Same prompt differs by region mid-rollout. | Pre-stage artifacts, per-region canary, version-pinned sessions & overflow. |
Staff-level summary
Multi-DC model serving is a placement and admission problem wrapped around scarce GPUs. GeoDNS/anycast plus an L7 load balancer get a request to the nearest region with capacity; load-aware, KV-affinity routing picks the replica; bounded overflow covers the seams without cascading. Capacity is tamed by follow-the-sun pooling and tiered placement rather than replicating every big model everywhere; pre-staged artifacts with per-region canary keep versions consistent; and drain-and-reroute failover with graceful degradation survives a region loss. The single sentence to say out loud: route to the closest GPU that can still meet the SLA, spill only within the latency budget, and degrade on purpose rather than overload by accident.