System Design Notes All designs

AI / ML Infrastructure

Multi-Datacenter Model Serving with Latency SLAs

Users are everywhere; GPUs are not. A global inference service has to answer requests from every continent inside a tight p99 latency SLA, yet the accelerators that run a 70B model are scarce, expensive, and unevenly distributed — plentiful in one region, almost absent in another. The whole design is a negotiation between three forces: geography (speed of light sets a network floor), capacity (a region can run out of GPUs), and cost (you cannot replicate every big model into every region). Get routing, overflow, and failover right and you keep the SLA without paying to over-provision the planet.

Requirements

Scope is the serving and routing layer across regions: training, the model registry internals, and single-replica inference mechanics are covered elsewhere (see LLM inference platform and model registry). Here the job is to take an already-deployed model and answer a global stream of requests within an SLA while GPUs are scarce and lumpy across regions.

Functional Non-functional
Route to a model replica — resolve (model, version) to a healthy replica that has spare capacity, and proxy the (often streaming) response back to the caller. p99 latency SLA — e.g. region-local p99 under 800 ms, end-to-end p99 under 1.2 s including cross-region hops. The SLA, not average load, sizes the system.
Multi-region deployment — the same model served from several regions (US, EU, APAC) so most users are handled close to home. High availability — a full region outage must not drop global service; target 99.95%+ with no single-region dependency on the request path.
Failover — when a region is unhealthy or saturated, reroute its traffic to another region automatically, without manual paging. GPU cost efficiency — keep accelerator utilization high; idle GPUs in a low-traffic region are the most expensive mistake in the design.
Affinity-aware routing — keep a session (or shared prompt prefix) pinned to a replica so the KV cache can be reused instead of recomputed. Capacity in scarce regions — degrade gracefully (overflow, smaller model, queue) rather than fail when a region simply does not have enough GPUs.

The one-line framing

This is not a sharding problem — every replica can answer any request. It is a placement and admission problem: which region, which replica, and what to do when the nearest one is full. Latency is mostly geography plus queueing; availability and cost come from how you spread and spill capacity.

Scale math

Concrete numbers keep the design honest. The table below is a plausible global LLM service; the exact figures matter less than the relationships — that one region holds most of the GPUs, and that a single cross-region hop can eat a large slice of the budget.

Quantity Estimate Why it matters
Global peak QPS ~50,000 req/s Diurnal; the peak rotates around the globe by time zone.
Serving regions 3 primary (US, EU, APAC) + edge PoPs Few GPU regions; many edge points for TLS / routing.
Per-region GPU capacity US ~600, EU ~300, APAC ~150 GPUs Uneven — APAC is the constrained region.
Cross-region RTT 50–150 ms An overflow hop can cost more than local inference queueing.
Latency budget (p99 800 ms) net ~40 + queue ~100 + inference ~600 + margin ~60 ms Inference dominates; cross-region overflow blows the network slice.
Model footprint / region 70B ~140 GB weights + tens of GB KV cache Why you cannot replicate every large model into every region.

Budget reading

With a 800 ms p99 SLA and ~600 ms of inference, you have only ~200 ms left for network plus queue. A local request spends it on queueing; a 120 ms overflow hop to another region leaves almost nothing for queueing, so overflow is only viable while the inference itself is fast and the neighbor is lightly loaded.

Global routing

Routing is the heart of the system, and it happens in two tiers. A global tier gets the request to the right region; a regional tier picks the right replica and decides whether to serve, overflow, or shed. The goal at every hop is the same: the lowest-latency replica that still has capacity, without violating the SLA.

Tier 1 — getting to a region (GeoDNS / anycast + L7)

flowchart TD
    U["Users (global)"] --> DNS["GeoDNS / Anycast"]
    DNS --> GLB["Global L7 Load Balancer"]
    GLB --> GUS["US Gateway"]
    GLB --> GEU["EU Gateway"]
    GLB --> GAP["APAC Gateway"]
    GUS --> RUS["US GPU Replicas"]
    GEU --> REU["EU GPU Replicas"]
    GAP --> RAP["APAC GPU Replicas"]
    GAP -.->|overflow| GUS
    GEU -.->|overflow| GUS
    RUS --> KUS["KV Cache US"]
    REU --> KEU["KV Cache EU"]
    RAP --> KAP["KV Cache APAC"]
      

Tier 2 — choosing a replica inside a region

Once inside a region, a stateless gateway runs the admission decision. Three ideas dominate, in priority order:

flowchart TD
    R["Request hits region gateway"] --> A{"Local capacity free?"}
    A -->|yes| B{"Affinity key present?"}
    B -->|hit| L["Sticky replica (reuse KV)"]
    B -->|miss| LL["Least-loaded local replica"]
    A -->|no| C{"Neighbor within budget?"}
    C -->|yes| O["Overflow to neighbor region"]
    C -->|no| D["Queue briefly, else shed (backpressure)"]
      

Affinity vs balance is a real tension

Hard affinity maximizes KV reuse but creates hotspots; pure load-balancing maximizes utilization but recomputes prefixes. The staff answer is soft affinity: prefer the sticky replica, but break stickiness once its queue crosses a threshold, trading a little cache reuse for SLA protection.

Capacity & autoscaling across regions

Because GPUs are scarce and unevenly placed, capacity planning is a global optimization, not a per-region one. Three levers interact: where models live, how each region scales, and what happens at the edge of capacity.

Placement strategy — the core trade-off

Replicating a 140 GB model into every region buys the best latency but is the most expensive option, and it strands GPUs in low-traffic regions. The realistic designs sit between full replication and a single central pool.

Strategy Latency GPU cost Use when
Replicate everywhere Best (always local) Highest; idle GPUs in quiet regions A few hot models with steady global demand.
Hub & spoke Good locally, overflow pays RTT Moderate; big models in 1–2 hubs Large models with regional demand peaks.
Follow-the-sun Good if pool tracks demand Lowest for a global footprint Strongly diurnal, time-zone-separated traffic.
Tiered models Local for small, hop for large Efficient; small model everywhere Most traffic served by a small local model; large model central.

Cost vs latency in one sentence

Every region you replicate a big model into is GPUs you must pay for at that region's peak, not its average — so replicate the small, hot, latency-critical models widely and keep the large, expensive ones in a few hubs reached by bounded overflow.

Model & version consistency

Multiple regions serving “the same model” must agree on which version they run, or identical prompts return different answers depending on geography. Rollout is therefore a distributed, staged change — never a simultaneous global flip.

flowchart LR
    REG["Model Registry vN+1"] --> DIST["Artifact Distribution (CDN / P2P)"]
    DIST --> S1["Canary: one region"]
    S1 --> CHK{"SLO and quality OK?"}
    CHK -->|yes| S2["Expand to more regions"]
    S2 --> S3["Full fleet on vN+1"]
    CHK -->|no| RB["Rollback to vN"]
      

Pre-stage, then flip

The expensive, slow step is moving 140 GB of weights to every region; the risky step is the cutover. Decouple them: distribute artifacts ahead of time so promotion is an instant, reversible pointer change — that is what makes per-region canary and fast rollback possible.

Failover & resilience

A region can fail outright (network partition, power, control-plane outage) or fail softly (GPUs saturated, latency climbing). The system must detect both, drain the bad region, and reroute — without tipping the rescuer into the same overload.

sequenceDiagram
    participant U as User (APAC)
    participant G as Global LB
    participant A as APAC Region
    participant N as Neighbor (US)
    participant H as Health Checker
    H->>A: Probe healthz + canary
    A-->>H: Timeout (unhealthy)
    H->>G: Mark APAC drained
    U->>G: Inference request
    G->>N: Reroute to nearest healthy
    N->>N: Admit, maybe smaller model
    N-->>U: Response within degraded SLA
    Note over G,N: Bounded spillover + circuit breaker prevent cascade
      

Failover is a capacity problem in disguise

Rerouting only works if somewhere has spare GPUs. Either keep N+1 regional headroom (expensive) or accept explicit degradation during a regional loss. Pick one on purpose and write it into the SLA — silent overload is the worst of both.

Bottlenecks & scaling

The recurring failure modes all trace back to the same root: capacity is finite and geography is fixed. Naming them — with the mitigation — is the staff-level move.

Bottleneck Symptom Mitigation
Cross-region latency Overflowed requests breach the SLA from RTT alone. Route nearest-with-capacity; only overflow within budget; place replicas closer to demand.
GPU scarcity & imbalance One region (APAC) saturates while another idles. Follow-the-sun pooling, autoscale on queue/KV pressure, tiered models.
Overflow overload Spillover tips the neighbor; cascade / brownout. Bounded spillover, load shedding, circuit breakers, priority classes.
Replication cost Idle GPUs holding big models in quiet regions. Hub & spoke for large models; replicate only small, hot ones widely.
Cold starts Scale-up too slow (140 GB load) to catch a spike. Warm buffer, pre-staged weights, scale early on leading indicators.
Version skew Same prompt differs by region mid-rollout. Pre-stage artifacts, per-region canary, version-pinned sessions & overflow.

Staff-level summary

Multi-DC model serving is a placement and admission problem wrapped around scarce GPUs. GeoDNS/anycast plus an L7 load balancer get a request to the nearest region with capacity; load-aware, KV-affinity routing picks the replica; bounded overflow covers the seams without cascading. Capacity is tamed by follow-the-sun pooling and tiered placement rather than replicating every big model everywhere; pre-staged artifacts with per-region canary keep versions consistent; and drain-and-reroute failover with graceful degradation survives a region loss. The single sentence to say out loud: route to the closest GPU that can still meet the SLA, spill only within the latency budget, and degrade on purpose rather than overload by accident.