AI / ML Infrastructure

Disaggregated GPU Memory

GPU HBM is the scarcest, most expensive resource in the rack: a few hundred gigabytes per node, soldered to the package, and almost always the thing that OOMs first. Models, optimizer states, and KV-caches routinely exceed what one GPU holds, while across a fleet memory is stranded — idle on half the nodes while the other half is pinned at capacity. Disaggregated memory answers this by treating capacity as a fabric-attached pool: keep the hot working set in HBM and transparently extend it with tiered remote memory over CXL (cache-coherent load/store) and RDMA (one-sided far-memory). The hard part is not capacity — it is the latency gap: HBM answers in hundreds of nanoseconds, the pool in microseconds. The entire design is about hiding that gap with tiering, prefetching, and overlap so the GPU never stalls.

The problem: HBM is scarce, expensive, and stranded

High-bandwidth memory delivers the multi-terabyte-per-second bandwidth that keeps thousands of GPU cores fed, but it pays for that bandwidth with tight capacity and high cost. A flagship accelerator ships with roughly 80–192 GB of HBM, it is physically stacked on the package and cannot be upgraded, and its dollars-per-gigabyte dwarf ordinary DRAM. That creates three structural problems at once:

Workloads exceed one GPU's memory. A 70B model in FP16 is ≈ 140 GB of weights before a single token is served; training adds gradients and optimizer states (Adam keeps ≈ 2× the parameters in FP32); long-context inference grows the KV-cache linearly with sequence length until it, not the FLOPs, caps the batch. The data structure is simply larger than the package.
Memory is stranded. GPUs and DRAM are bought in fixed ratios per server, but real jobs are lopsided — one is memory-bound and spilling, its neighbor is compute-bound with tens of free gigabytes. That idle capacity is trapped behind a node boundary and cannot be lent out.
Capacity and compute scale on different curves. You should not have to buy another whole GPU just to get more bytes, yet without pooling that is exactly the trade you are forced into.

The goal is to pool memory across the domain and extend HBM with tiered remote memory, so capacity becomes an elastic, shareable resource instead of a per-package constant.

Functional requirements

Transparently extend capacity beyond local HBM — allocate buffers larger than one GPU's memory without rewriting kernels.
Share a memory pool across GPUs and nodes, so stranded capacity on one host is usable by another (dynamic attach / detach).
Tier automatically between HBM, host DRAM, CXL, and remote memory based on access patterns, with the hot working set kept resident.
Preserve correctness: consistent reads/writes and, where the fabric is coherent, ordinary load/store semantics.

Non-functional requirements

Acceptable latency & bandwidth. Added access cost must be hideable behind compute; effective throughput should not collapse versus an all-HBM baseline for the targeted workloads.
Capacity scaling. Grow the pool to terabytes independently of GPU count.
Cost efficiency. Tiered DRAM/CXL must be materially cheaper per gigabyte than adding HBM-equipped GPUs, and pooling must raise utilization.
Isolation & reliability. Pooled memory is a shared fault and noisy-neighbor domain; tenants need bandwidth/capacity isolation and graceful behavior when a pool node fails.

The memory hierarchy: latency & bandwidth tiers

Disaggregation only makes sense against the shape of the memory hierarchy. Each step away from the GPU buys more capacity at lower cost but pays in higher latency and lower bandwidth. The numbers below are order-of-magnitude — exact values depend on generation, link width, and topology — but the ratios are what drive the design.

Tier	Typical latency	Typical bandwidth	Role
Local HBM	~200–500 ns	~3–8 TB/s	Weights, active KV, the hot working set
Host DRAM over PCIe Gen5	~1–2 µs	~50–64 GB/s per x16 link	Software offload tier (weights, optimizer, KV)
Host DRAM over NVLink-C2C	~0.5–1 µs	~450–900 GB/s	Coherent host-memory extension (superchip)
CXL-attached memory	~300 ns–1 µs	~tens of GB/s per link	Cache-coherent pooled capacity expansion
Remote memory over RDMA	~1–3 µs	~tens of GB/s per NIC	Disaggregated far-memory pool
NVMe SSD	~10–100 µs	~3–7 GB/s	Cold spill / capacity of last resort

Read the table top to bottom and the story is a cliff, not a slope. HBM bandwidth is measured in terabytes per second; every tier below it is in gigabytes per second — a 50–100× drop. Latency moves the other way: from hundreds of nanoseconds in HBM to single-digit microseconds at the pool, a 10×–1000× gap. That gap is the core challenge. A GPU streaming multiprocessor that stalls on a microsecond-scale load wastes thousands of FLOP-cycles, so remote memory is only viable if accesses are predictable enough to prefetch and overlappable with compute. Note also that CXL and RDMA are different beasts: CXL exposes byte-addressable, cache-coherent load/store (the GPU or CPU touches it like NUMA-far DRAM), while RDMA moves pages/blocks with explicit one-sided verbs. NVLink-C2C is the outlier — it is so wide that coherent host DRAM behaves almost like a slow second HBM tier.

flowchart TD
    GPU["GPU compute (SMs)"]
    HBM["Local HBM (TB/s, ~100s ns)"]
    DRAM["Host DRAM via PCIe / NVLink-C2C"]
    CXL["CXL-attached memory (pooled)"]
    RDMA["Remote memory over RDMA (~us)"]
    NVME["NVMe SSD (ms, cold spill)"]
    GPU --> HBM
    HBM -->|"capacity miss"| DRAM
    DRAM --> CXL
    CXL --> RDMA
    RDMA --> NVME
    HBM -. "latency gap 10x-1000x" .-> RDMA

Bandwidth hides; latency stalls

Two different problems hide inside "remote is slow". Bandwidth limits sustained throughput — you fix it by moving less data (keep the working set resident) or by streaming in parallel with compute. Latency limits a single dependent access — you fix it by issuing the fetch early (prefetch) so the answer is already in HBM before the kernel asks. Disaggregation succeeds when both can be hidden, and fails the moment an access is random, on the critical path, and un-prefetchable.

Approaches: offload, CXL pooling, and RDMA far-memory

Three families of techniques extend HBM, and they sit at different points on the triangle of latency vs capacity vs transparency. Real systems combine them into a tier stack.

(a) Host-memory offload / tiering

The framework-level approach: spill cold or not-yet-needed tensors from HBM into host DRAM (and onward to NVMe), then stream them back just in time. This is what ZeRO-Offload and ZeRO-Infinity do for training — parking optimizer states, gradients, and even parameters in CPU memory — and what KV-cache offload does for inference. It is software-managed and fully transparent at the framework boundary: the user calls the same API and the runtime decides what lives where. Capacity is large (host DRAM is cheap and plentiful), but bandwidth is gated by the PCIe link (tens of GB/s), so it only works when transfers overlap compute. NVLink-C2C dramatically widens this path on coherent superchips, turning offload from a last resort into a routine tier.

(b) CXL memory pooling

CXL (Compute Express Link) rides the PCIe physical layer but adds cache-coherent load/store semantics. A CXL memory device appears as an extra, byte-addressable NUMA node: software reads and writes it with ordinary instructions — no explicit copy, no app rewrite. CXL 2.0 introduces pooling, where one memory appliance is carved up and dynamically assigned to many hosts (so stranded capacity is reclaimed), and CXL 3.0 adds multi-level switching and memory sharing across hosts. The trade: latency sits above local DRAM (an extra hop of a few hundred nanoseconds) and coherence traffic has overhead, but it is the lowest-latency, most transparent way to add pooled capacity. GPUs reach it either through the host's coherent fabric or, increasingly, more directly.

(c) RDMA far-memory / memory disaggregation

The most aggressive option treats another machine's DRAM as a memory pool, reached with one-sided RDMA reads/writes over InfiniBand or RoCE — no remote CPU involvement on the data path. Classic far-memory systems (Infiniswap, AIFM, and friends) page memory in and out at page or object granularity, often behind a fault handler so the application sees ordinary memory. This buys the largest, most flexible pool — terabytes, decoupled entirely from GPU count — at the cost of microsecond latency and page-granular, explicit movement. It demands aggressive prefetch and asynchrony to stay off the critical path.

Approach	Latency	Capacity	Transparency	Granularity
Host offload (PCIe/C2C)	µs (C2C lower)	Large (host DRAM + NVMe)	Framework-managed	Tensor / block
CXL pooling	Lowest of the three	Pool-sized, shared	Load/store, transparent	Cache line / byte
RDMA far-memory	~1–3 µs	Largest, most elastic	Page-fault or explicit	Page / object

Conceptually the GPU sits at the apex with its hot working set in HBM, and reaches outward through progressively slower, larger tiers — host DRAM beside it, a pooled CXL device on the local fabric, and a remote DRAM pool across the network:

flowchart LR
    GPU["GPU + HBM (hot working set)"]
    subgraph LOCAL["Local node"]
      HOST["Host DRAM (offload tier)"]
    end
    subgraph FABRIC["Shared memory fabric"]
      CXL["CXL memory (pooled)"]
      POOL["Remote DRAM pool"]
    end
    GPU -->|"a: PCIe / NVLink-C2C"| HOST
    GPU -->|"b: CXL load/store"| CXL
    HOST -->|"c: RDMA verbs"| POOL
    CXL --> POOL
    HOST --> CXL

Tiering & prefetching: hiding the latency gap

Pooled capacity is only useful if the GPU rarely waits on it. The runtime therefore behaves like an operating system's virtual-memory manager, with the same toolkit applied to HBM as the precious tier:

Hot/cold classification. Track recency and frequency of access per page/block. The working set — current-layer weights, the active region of the KV-cache, tensors needed this step — stays pinned in HBM; everything cold is demoted to CXL/host/remote. Getting this split right is the whole game: HBM should hold what is touched now, not what might be.
Prefetching to hide latency. Most ML access is gloriously predictable. Transformer inference walks weights layer by layer; attention reads KV in order. So issue the fetch for layer L+1 (or the next KV block) while layer L computes. If the prefetch lands before the kernel needs the data, the µs latency is completely masked.
Overlap transfer with compute. Use separate CUDA streams and the GPU's dedicated copy engines with double-buffering: one buffer feeds the running kernel while the next is filled from the pool. The DMA runs concurrently with the SMs, so movement costs throughput only if it exceeds compute time.
Page migration / promotion-demotion. When a remote page turns hot, promote it into HBM; when an HBM page goes cold, demote it to make room. Migration has a cost, so hysteresis avoids thrashing pages that oscillate around the threshold.
What to keep in HBM. The resident set should be the active working set plus a prefetch lookahead window — never the whole model if it does not fit. Cold KV, stale optimizer state, and inactive MoE experts belong in the pool.

The access path below shows the fast HBM hit, the miss that pulls a page from the pool and demotes a cold one, and the crucial last step — prefetching the next page so the following access is a hit:

sequenceDiagram
    participant K as GPU Kernel
    participant H as Local HBM
    participant T as Tiering Manager
    participant R as CXL / Remote Pool
    K->>H: Access page P
    alt P resident in HBM
        H-->>K: Hit (fast path, ns)
    else P is cold (evicted)
        H->>T: Miss on P
        T->>R: Fetch P
        R-->>T: Page P bytes
        T->>H: Install P, demote cold page
        H-->>K: Resume kernel
    end
    Note over T,R: Overlap with compute
    T->>R: Prefetch P+1
    R-->>H: Stage next page

Predictability is the enabler

Disaggregation works for ML precisely because the access stream is regular — sequential layer sweeps and ordered KV reads — which makes prefetch accurate and overlap easy. The technique degrades sharply for irregular, data-dependent access (random embedding gather, pointer chasing) where the next address is unknown until the current load returns, leaving nothing to prefetch and the µs latency fully exposed.

Use cases: where disaggregation pays off

KV-cache offload for long-context inference. The KV-cache grows linearly with sequence length and dominates HBM at long context (see the LLM KV-Cache Management page). Cold blocks for large or paused sessions spill to host/CXL/remote and are fetched back per layer during attention. Because KV access is append-mostly and read in order, it prefetches beautifully — letting a fixed HBM budget hold many more concurrent long-context sessions.
Optimizer-state offload in training. Adam keeps first/second moments plus an FP32 master copy — roughly 2× the parameter bytes — that are only touched once per step in the optimizer phase. Parking them in CPU/CXL memory (ZeRO-Offload/Infinity) frees enormous HBM and lets a small GPU count train a model that would otherwise never fit, since the states stream in only when needed.
Large-embedding serving. Recommendation and retrieval models carry embedding tables of hundreds of gigabytes to terabytes. The pool holds the full table while HBM caches the hot rows; sparse gathers hit cache for popular IDs and fall through to the fabric for the long tail.
MoE expert offload. Mixture-of-Experts activates only a few experts per token, so inactive experts can live in pooled memory and be streamed in on selection, trading a fetch for a large HBM saving.

Disaggregation helps when…	…and hurts when
Access is predictable / prefetchable (layer sweeps, ordered KV)	Access is random & data-dependent (sparse gather, pointer chase)
Reuse is low — data is touched once per step or session	Reuse is high & tight — hot data thrashes across the fabric
Workload is capacity-bound and latency-tolerant (training, batch, long-context)	Latency-critical small-batch decode where every µs is on the critical path
Transfers overlap abundant compute	Steady state is already bandwidth-bound — no slack to hide movement

Bottlenecks & scaling

Bottleneck	Why it bites	Mitigation
Latency gap	HBM answers in ~100s ns, the pool in µs — a 10×–1000× gap that stalls SMs on any uncovered access	Prefetch ahead of use; overlap with compute; keep the working set resident; prefer CXL/C2C over RDMA for hot tiers
Interconnect bandwidth	PCIe/CXL/NIC links move tens of GB/s versus HBM's TB/s — a 50–100× cliff that caps sustained streaming	Move less data (good hot/cold split); compress/quantize; wider links (NVLink-C2C); stream in parallel with compute
Coherence overhead	Cache-coherent fabrics (CXL) pay snoop/directory traffic; shared pages across hosts add invalidation cost	Partition ownership; prefer read-mostly sharing; coarse-grained coherence; pin private hot data
Prefetch accuracy	A wrong prediction wastes bandwidth and still stalls on the real miss; irregular access defeats the predictor	Exploit known patterns (layer order, sequential KV); lookahead windows; fall back to demand-fetch gracefully
Cost & utilization	A pool only pays off if it is cheaper per GB and well utilized; idle pool capacity is just relocated stranding	Dynamic pooling/allocation across hosts; oversubscription with isolation; tier cold data to NVMe
Fault & tail domain	A pool node or link failure now affects many tenants; µs tiers add tail latency under contention	Replication/erasure for durable pools; bandwidth isolation/QoS; blast-radius limits; checkpoint critical state

Summary

Disaggregated memory turns HBM from a hard per-package ceiling into the top tier of a managed hierarchy. Keep the hot working set in HBM, extend it with host DRAM, CXL pools, and RDMA far-memory, and let a virtual-memory-style manager classify hot/cold, prefetch along predictable access streams, and overlap every transfer with compute. The win is real — terabyte pools, reclaimed stranded capacity, and models that no longer have to fit one GPU — but it is bounded by a single hard constant: the latency gap between nanoseconds and microseconds. Pick the tier that matches the access pattern (CXL for transparent low-latency expansion, RDMA for elastic capacity, host offload for cheap bulk), exploit the regularity of ML access to hide that gap, and disaggregation buys capacity and utilization at a fraction of the cost of adding GPUs — fail to hide it, and the pool becomes a stall machine.