System Design Notes All designs

AI / ML Infrastructure

Disaggregated GPU Memory

GPU HBM is the scarcest, most expensive resource in the rack: a few hundred gigabytes per node, soldered to the package, and almost always the thing that OOMs first. Models, optimizer states, and KV-caches routinely exceed what one GPU holds, while across a fleet memory is stranded — idle on half the nodes while the other half is pinned at capacity. Disaggregated memory answers this by treating capacity as a fabric-attached pool: keep the hot working set in HBM and transparently extend it with tiered remote memory over CXL (cache-coherent load/store) and RDMA (one-sided far-memory). The hard part is not capacity — it is the latency gap: HBM answers in hundreds of nanoseconds, the pool in microseconds. The entire design is about hiding that gap with tiering, prefetching, and overlap so the GPU never stalls.

The problem: HBM is scarce, expensive, and stranded

High-bandwidth memory delivers the multi-terabyte-per-second bandwidth that keeps thousands of GPU cores fed, but it pays for that bandwidth with tight capacity and high cost. A flagship accelerator ships with roughly 80–192 GB of HBM, it is physically stacked on the package and cannot be upgraded, and its dollars-per-gigabyte dwarf ordinary DRAM. That creates three structural problems at once:

The goal is to pool memory across the domain and extend HBM with tiered remote memory, so capacity becomes an elastic, shareable resource instead of a per-package constant.

Functional requirements

Non-functional requirements

The memory hierarchy: latency & bandwidth tiers

Disaggregation only makes sense against the shape of the memory hierarchy. Each step away from the GPU buys more capacity at lower cost but pays in higher latency and lower bandwidth. The numbers below are order-of-magnitude — exact values depend on generation, link width, and topology — but the ratios are what drive the design.

Tier Typical latency Typical bandwidth Role
Local HBM ~200–500 ns ~3–8 TB/s Weights, active KV, the hot working set
Host DRAM over PCIe Gen5 ~1–2 µs ~50–64 GB/s per x16 link Software offload tier (weights, optimizer, KV)
Host DRAM over NVLink-C2C ~0.5–1 µs ~450–900 GB/s Coherent host-memory extension (superchip)
CXL-attached memory ~300 ns–1 µs ~tens of GB/s per link Cache-coherent pooled capacity expansion
Remote memory over RDMA ~1–3 µs ~tens of GB/s per NIC Disaggregated far-memory pool
NVMe SSD ~10–100 µs ~3–7 GB/s Cold spill / capacity of last resort

Read the table top to bottom and the story is a cliff, not a slope. HBM bandwidth is measured in terabytes per second; every tier below it is in gigabytes per second — a 50–100× drop. Latency moves the other way: from hundreds of nanoseconds in HBM to single-digit microseconds at the pool, a 10×–1000× gap. That gap is the core challenge. A GPU streaming multiprocessor that stalls on a microsecond-scale load wastes thousands of FLOP-cycles, so remote memory is only viable if accesses are predictable enough to prefetch and overlappable with compute. Note also that CXL and RDMA are different beasts: CXL exposes byte-addressable, cache-coherent load/store (the GPU or CPU touches it like NUMA-far DRAM), while RDMA moves pages/blocks with explicit one-sided verbs. NVLink-C2C is the outlier — it is so wide that coherent host DRAM behaves almost like a slow second HBM tier.

flowchart TD
    GPU["GPU compute (SMs)"]
    HBM["Local HBM (TB/s, ~100s ns)"]
    DRAM["Host DRAM via PCIe / NVLink-C2C"]
    CXL["CXL-attached memory (pooled)"]
    RDMA["Remote memory over RDMA (~us)"]
    NVME["NVMe SSD (ms, cold spill)"]
    GPU --> HBM
    HBM -->|"capacity miss"| DRAM
    DRAM --> CXL
    CXL --> RDMA
    RDMA --> NVME
    HBM -. "latency gap 10x-1000x" .-> RDMA
      

Bandwidth hides; latency stalls

Two different problems hide inside "remote is slow". Bandwidth limits sustained throughput — you fix it by moving less data (keep the working set resident) or by streaming in parallel with compute. Latency limits a single dependent access — you fix it by issuing the fetch early (prefetch) so the answer is already in HBM before the kernel asks. Disaggregation succeeds when both can be hidden, and fails the moment an access is random, on the critical path, and un-prefetchable.

Approaches: offload, CXL pooling, and RDMA far-memory

Three families of techniques extend HBM, and they sit at different points on the triangle of latency vs capacity vs transparency. Real systems combine them into a tier stack.

(a) Host-memory offload / tiering

The framework-level approach: spill cold or not-yet-needed tensors from HBM into host DRAM (and onward to NVMe), then stream them back just in time. This is what ZeRO-Offload and ZeRO-Infinity do for training — parking optimizer states, gradients, and even parameters in CPU memory — and what KV-cache offload does for inference. It is software-managed and fully transparent at the framework boundary: the user calls the same API and the runtime decides what lives where. Capacity is large (host DRAM is cheap and plentiful), but bandwidth is gated by the PCIe link (tens of GB/s), so it only works when transfers overlap compute. NVLink-C2C dramatically widens this path on coherent superchips, turning offload from a last resort into a routine tier.

(b) CXL memory pooling

CXL (Compute Express Link) rides the PCIe physical layer but adds cache-coherent load/store semantics. A CXL memory device appears as an extra, byte-addressable NUMA node: software reads and writes it with ordinary instructions — no explicit copy, no app rewrite. CXL 2.0 introduces pooling, where one memory appliance is carved up and dynamically assigned to many hosts (so stranded capacity is reclaimed), and CXL 3.0 adds multi-level switching and memory sharing across hosts. The trade: latency sits above local DRAM (an extra hop of a few hundred nanoseconds) and coherence traffic has overhead, but it is the lowest-latency, most transparent way to add pooled capacity. GPUs reach it either through the host's coherent fabric or, increasingly, more directly.

(c) RDMA far-memory / memory disaggregation

The most aggressive option treats another machine's DRAM as a memory pool, reached with one-sided RDMA reads/writes over InfiniBand or RoCE — no remote CPU involvement on the data path. Classic far-memory systems (Infiniswap, AIFM, and friends) page memory in and out at page or object granularity, often behind a fault handler so the application sees ordinary memory. This buys the largest, most flexible pool — terabytes, decoupled entirely from GPU count — at the cost of microsecond latency and page-granular, explicit movement. It demands aggressive prefetch and asynchrony to stay off the critical path.

Approach Latency Capacity Transparency Granularity
Host offload (PCIe/C2C) µs (C2C lower) Large (host DRAM + NVMe) Framework-managed Tensor / block
CXL pooling Lowest of the three Pool-sized, shared Load/store, transparent Cache line / byte
RDMA far-memory ~1–3 µs Largest, most elastic Page-fault or explicit Page / object

Conceptually the GPU sits at the apex with its hot working set in HBM, and reaches outward through progressively slower, larger tiers — host DRAM beside it, a pooled CXL device on the local fabric, and a remote DRAM pool across the network:

flowchart LR
    GPU["GPU + HBM (hot working set)"]
    subgraph LOCAL["Local node"]
      HOST["Host DRAM (offload tier)"]
    end
    subgraph FABRIC["Shared memory fabric"]
      CXL["CXL memory (pooled)"]
      POOL["Remote DRAM pool"]
    end
    GPU -->|"a: PCIe / NVLink-C2C"| HOST
    GPU -->|"b: CXL load/store"| CXL
    HOST -->|"c: RDMA verbs"| POOL
    CXL --> POOL
    HOST --> CXL
      

Tiering & prefetching: hiding the latency gap

Pooled capacity is only useful if the GPU rarely waits on it. The runtime therefore behaves like an operating system's virtual-memory manager, with the same toolkit applied to HBM as the precious tier:

The access path below shows the fast HBM hit, the miss that pulls a page from the pool and demotes a cold one, and the crucial last step — prefetching the next page so the following access is a hit:

sequenceDiagram
    participant K as GPU Kernel
    participant H as Local HBM
    participant T as Tiering Manager
    participant R as CXL / Remote Pool
    K->>H: Access page P
    alt P resident in HBM
        H-->>K: Hit (fast path, ns)
    else P is cold (evicted)
        H->>T: Miss on P
        T->>R: Fetch P
        R-->>T: Page P bytes
        T->>H: Install P, demote cold page
        H-->>K: Resume kernel
    end
    Note over T,R: Overlap with compute
    T->>R: Prefetch P+1
    R-->>H: Stage next page
      

Predictability is the enabler

Disaggregation works for ML precisely because the access stream is regular — sequential layer sweeps and ordered KV reads — which makes prefetch accurate and overlap easy. The technique degrades sharply for irregular, data-dependent access (random embedding gather, pointer chasing) where the next address is unknown until the current load returns, leaving nothing to prefetch and the µs latency fully exposed.

Use cases: where disaggregation pays off

Disaggregation helps when… …and hurts when
Access is predictable / prefetchable (layer sweeps, ordered KV) Access is random & data-dependent (sparse gather, pointer chase)
Reuse is low — data is touched once per step or session Reuse is high & tight — hot data thrashes across the fabric
Workload is capacity-bound and latency-tolerant (training, batch, long-context) Latency-critical small-batch decode where every µs is on the critical path
Transfers overlap abundant compute Steady state is already bandwidth-bound — no slack to hide movement

Bottlenecks & scaling

Bottleneck Why it bites Mitigation
Latency gap HBM answers in ~100s ns, the pool in µs — a 10×–1000× gap that stalls SMs on any uncovered access Prefetch ahead of use; overlap with compute; keep the working set resident; prefer CXL/C2C over RDMA for hot tiers
Interconnect bandwidth PCIe/CXL/NIC links move tens of GB/s versus HBM's TB/s — a 50–100× cliff that caps sustained streaming Move less data (good hot/cold split); compress/quantize; wider links (NVLink-C2C); stream in parallel with compute
Coherence overhead Cache-coherent fabrics (CXL) pay snoop/directory traffic; shared pages across hosts add invalidation cost Partition ownership; prefer read-mostly sharing; coarse-grained coherence; pin private hot data
Prefetch accuracy A wrong prediction wastes bandwidth and still stalls on the real miss; irregular access defeats the predictor Exploit known patterns (layer order, sequential KV); lookahead windows; fall back to demand-fetch gracefully
Cost & utilization A pool only pays off if it is cheaper per GB and well utilized; idle pool capacity is just relocated stranding Dynamic pooling/allocation across hosts; oversubscription with isolation; tier cold data to NVMe
Fault & tail domain A pool node or link failure now affects many tenants; µs tiers add tail latency under contention Replication/erasure for durable pools; bandwidth isolation/QoS; blast-radius limits; checkpoint critical state

Summary

Disaggregated memory turns HBM from a hard per-package ceiling into the top tier of a managed hierarchy. Keep the hot working set in HBM, extend it with host DRAM, CXL pools, and RDMA far-memory, and let a virtual-memory-style manager classify hot/cold, prefetch along predictable access streams, and overlap every transfer with compute. The win is real — terabyte pools, reclaimed stranded capacity, and models that no longer have to fit one GPU — but it is bounded by a single hard constant: the latency gap between nanoseconds and microseconds. Pick the tier that matches the access pattern (CXL for transparent low-latency expansion, RDMA for elastic capacity, host offload for cheap bulk), exploit the regularity of ML access to hide that gap, and disaggregation buys capacity and utilization at a fraction of the cost of adding GPUs — fail to hide it, and the pool becomes a stall machine.