AI / ML Infrastructure
Disaggregated GPU Memory
GPU
HBM is the scarcest, most expensive resource in the rack: a few hundred gigabytes per node, soldered to the package, and
almost always the thing that OOMs first. Models, optimizer states, and
KV-caches routinely exceed what one GPU holds, while across a fleet
memory is stranded — idle on half the nodes while the
other half is pinned at capacity.
Disaggregated memory answers this by treating
capacity as a fabric-attached pool: keep the hot working set in HBM
and transparently extend it with
tiered remote memory over
CXL (cache-coherent load/store) and RDMA
(one-sided far-memory). The hard part is not capacity — it is the
latency gap: HBM answers in hundreds of nanoseconds,
the pool in microseconds. The entire design is about hiding that gap
with tiering, prefetching, and overlap so the GPU never stalls.
The problem: HBM is scarce, expensive, and stranded
High-bandwidth memory delivers the multi-terabyte-per-second bandwidth
that keeps thousands of GPU cores fed, but it pays for that bandwidth
with
tight capacity and high cost. A flagship accelerator
ships with roughly 80–192 GB of HBM, it is physically
stacked on the package and cannot be upgraded, and its
dollars-per-gigabyte dwarf ordinary DRAM. That creates three
structural problems at once:
-
Workloads exceed one GPU's memory. A 70B model in
FP16 is ≈ 140 GB of weights before a single token is served;
training adds gradients and optimizer states (Adam keeps ≈
2×the parameters in FP32); long-context inference grows the KV-cache linearly with sequence length until it, not the FLOPs, caps the batch. The data structure is simply larger than the package. - Memory is stranded. GPUs and DRAM are bought in fixed ratios per server, but real jobs are lopsided — one is memory-bound and spilling, its neighbor is compute-bound with tens of free gigabytes. That idle capacity is trapped behind a node boundary and cannot be lent out.
- Capacity and compute scale on different curves. You should not have to buy another whole GPU just to get more bytes, yet without pooling that is exactly the trade you are forced into.
The goal is to pool memory across the domain and extend HBM with tiered remote memory, so capacity becomes an elastic, shareable resource instead of a per-package constant.
Functional requirements
- Transparently extend capacity beyond local HBM — allocate buffers larger than one GPU's memory without rewriting kernels.
- Share a memory pool across GPUs and nodes, so stranded capacity on one host is usable by another (dynamic attach / detach).
- Tier automatically between HBM, host DRAM, CXL, and remote memory based on access patterns, with the hot working set kept resident.
- Preserve correctness: consistent reads/writes and, where the fabric is coherent, ordinary load/store semantics.
Non-functional requirements
- Acceptable latency & bandwidth. Added access cost must be hideable behind compute; effective throughput should not collapse versus an all-HBM baseline for the targeted workloads.
- Capacity scaling. Grow the pool to terabytes independently of GPU count.
- Cost efficiency. Tiered DRAM/CXL must be materially cheaper per gigabyte than adding HBM-equipped GPUs, and pooling must raise utilization.
- Isolation & reliability. Pooled memory is a shared fault and noisy-neighbor domain; tenants need bandwidth/capacity isolation and graceful behavior when a pool node fails.
The memory hierarchy: latency & bandwidth tiers
Disaggregation only makes sense against the shape of the memory hierarchy. Each step away from the GPU buys more capacity at lower cost but pays in higher latency and lower bandwidth. The numbers below are order-of-magnitude — exact values depend on generation, link width, and topology — but the ratios are what drive the design.
| Tier | Typical latency | Typical bandwidth | Role |
|---|---|---|---|
| Local HBM | ~200–500 ns | ~3–8 TB/s | Weights, active KV, the hot working set |
| Host DRAM over PCIe Gen5 | ~1–2 µs | ~50–64 GB/s per x16 link | Software offload tier (weights, optimizer, KV) |
| Host DRAM over NVLink-C2C | ~0.5–1 µs | ~450–900 GB/s | Coherent host-memory extension (superchip) |
| CXL-attached memory | ~300 ns–1 µs | ~tens of GB/s per link | Cache-coherent pooled capacity expansion |
| Remote memory over RDMA | ~1–3 µs | ~tens of GB/s per NIC | Disaggregated far-memory pool |
| NVMe SSD | ~10–100 µs | ~3–7 GB/s | Cold spill / capacity of last resort |
Read the table top to bottom and the story is a
cliff, not a slope. HBM bandwidth is measured in
terabytes per second; every tier below it is in
gigabytes per second — a 50–100× drop. Latency moves the
other way: from hundreds of nanoseconds in HBM to single-digit
microseconds at the pool, a 10×–1000× gap. That gap
is the core challenge. A GPU streaming multiprocessor that stalls on a
microsecond-scale load wastes thousands of FLOP-cycles, so remote
memory is only viable if accesses are
predictable enough to prefetch and
overlappable with compute. Note also that
CXL and RDMA are different beasts: CXL
exposes byte-addressable, cache-coherent load/store (the GPU
or CPU touches it like NUMA-far DRAM), while RDMA moves
pages/blocks with explicit one-sided verbs. NVLink-C2C is the
outlier — it is so wide that coherent host DRAM behaves almost like a
slow second HBM tier.
flowchart TD
GPU["GPU compute (SMs)"]
HBM["Local HBM (TB/s, ~100s ns)"]
DRAM["Host DRAM via PCIe / NVLink-C2C"]
CXL["CXL-attached memory (pooled)"]
RDMA["Remote memory over RDMA (~us)"]
NVME["NVMe SSD (ms, cold spill)"]
GPU --> HBM
HBM -->|"capacity miss"| DRAM
DRAM --> CXL
CXL --> RDMA
RDMA --> NVME
HBM -. "latency gap 10x-1000x" .-> RDMA
Bandwidth hides; latency stalls
Two different problems hide inside "remote is slow". Bandwidth limits sustained throughput — you fix it by moving less data (keep the working set resident) or by streaming in parallel with compute. Latency limits a single dependent access — you fix it by issuing the fetch early (prefetch) so the answer is already in HBM before the kernel asks. Disaggregation succeeds when both can be hidden, and fails the moment an access is random, on the critical path, and un-prefetchable.
Approaches: offload, CXL pooling, and RDMA far-memory
Three families of techniques extend HBM, and they sit at different points on the triangle of latency vs capacity vs transparency. Real systems combine them into a tier stack.
(a) Host-memory offload / tiering
The framework-level approach: spill cold or not-yet-needed tensors
from HBM into
host DRAM (and onward to NVMe), then stream them back
just in time. This is what ZeRO-Offload and
ZeRO-Infinity do for training — parking
optimizer states, gradients, and even parameters
in CPU memory — and what KV-cache offload does for inference. It is
software-managed and fully transparent at the framework
boundary: the user calls the same API and the runtime decides what lives
where. Capacity is large (host DRAM is cheap and plentiful), but
bandwidth is gated by the PCIe link (tens of GB/s),
so it only works when transfers overlap compute. NVLink-C2C
dramatically widens this path on coherent superchips, turning offload
from a last resort into a routine tier.
(b) CXL memory pooling
CXL (Compute Express Link) rides the PCIe physical layer
but adds cache-coherent load/store semantics. A CXL
memory device appears as an extra, byte-addressable NUMA node:
software reads and writes it with ordinary instructions —
no explicit copy, no app rewrite. CXL 2.0 introduces
pooling, where one memory appliance is carved up and
dynamically assigned to many hosts (so stranded capacity is
reclaimed), and CXL 3.0 adds multi-level switching and memory
sharing across hosts. The trade: latency sits above local
DRAM (an extra hop of a few hundred nanoseconds) and coherence traffic
has overhead, but it is the
lowest-latency, most transparent
way to add pooled capacity. GPUs reach it either through the host's
coherent fabric or, increasingly, more directly.
(c) RDMA far-memory / memory disaggregation
The most aggressive option treats
another machine's DRAM as a memory pool, reached with
one-sided RDMA reads/writes
over InfiniBand or RoCE — no remote CPU involvement on the data path.
Classic far-memory systems (Infiniswap, AIFM, and friends) page memory
in and out at
page or object granularity, often behind a fault
handler so the application sees ordinary memory. This buys the
largest, most flexible pool — terabytes, decoupled
entirely from GPU count — at the cost of
microsecond latency and
page-granular, explicit movement. It demands
aggressive prefetch and asynchrony to stay off the critical path.
| Approach | Latency | Capacity | Transparency | Granularity |
|---|---|---|---|---|
| Host offload (PCIe/C2C) | µs (C2C lower) | Large (host DRAM + NVMe) | Framework-managed | Tensor / block |
| CXL pooling | Lowest of the three | Pool-sized, shared | Load/store, transparent | Cache line / byte |
| RDMA far-memory | ~1–3 µs | Largest, most elastic | Page-fault or explicit | Page / object |
Conceptually the GPU sits at the apex with its hot working set in HBM, and reaches outward through progressively slower, larger tiers — host DRAM beside it, a pooled CXL device on the local fabric, and a remote DRAM pool across the network:
flowchart LR
GPU["GPU + HBM (hot working set)"]
subgraph LOCAL["Local node"]
HOST["Host DRAM (offload tier)"]
end
subgraph FABRIC["Shared memory fabric"]
CXL["CXL memory (pooled)"]
POOL["Remote DRAM pool"]
end
GPU -->|"a: PCIe / NVLink-C2C"| HOST
GPU -->|"b: CXL load/store"| CXL
HOST -->|"c: RDMA verbs"| POOL
CXL --> POOL
HOST --> CXL
Tiering & prefetching: hiding the latency gap
Pooled capacity is only useful if the GPU rarely waits on it. The runtime therefore behaves like an operating system's virtual-memory manager, with the same toolkit applied to HBM as the precious tier:
- Hot/cold classification. Track recency and frequency of access per page/block. The working set — current-layer weights, the active region of the KV-cache, tensors needed this step — stays pinned in HBM; everything cold is demoted to CXL/host/remote. Getting this split right is the whole game: HBM should hold what is touched now, not what might be.
-
Prefetching to hide latency. Most ML access is
gloriously predictable. Transformer inference walks weights
layer by layer; attention reads KV
in order. So issue the fetch for layer
L+1(or the next KV block) while layerLcomputes. If the prefetch lands before the kernel needs the data, the µs latency is completely masked. - Overlap transfer with compute. Use separate CUDA streams and the GPU's dedicated copy engines with double-buffering: one buffer feeds the running kernel while the next is filled from the pool. The DMA runs concurrently with the SMs, so movement costs throughput only if it exceeds compute time.
- Page migration / promotion-demotion. When a remote page turns hot, promote it into HBM; when an HBM page goes cold, demote it to make room. Migration has a cost, so hysteresis avoids thrashing pages that oscillate around the threshold.
- What to keep in HBM. The resident set should be the active working set plus a prefetch lookahead window — never the whole model if it does not fit. Cold KV, stale optimizer state, and inactive MoE experts belong in the pool.
The access path below shows the fast HBM hit, the miss that pulls a page from the pool and demotes a cold one, and the crucial last step — prefetching the next page so the following access is a hit:
sequenceDiagram
participant K as GPU Kernel
participant H as Local HBM
participant T as Tiering Manager
participant R as CXL / Remote Pool
K->>H: Access page P
alt P resident in HBM
H-->>K: Hit (fast path, ns)
else P is cold (evicted)
H->>T: Miss on P
T->>R: Fetch P
R-->>T: Page P bytes
T->>H: Install P, demote cold page
H-->>K: Resume kernel
end
Note over T,R: Overlap with compute
T->>R: Prefetch P+1
R-->>H: Stage next page
Predictability is the enabler
Disaggregation works for ML precisely because the access stream is regular — sequential layer sweeps and ordered KV reads — which makes prefetch accurate and overlap easy. The technique degrades sharply for irregular, data-dependent access (random embedding gather, pointer chasing) where the next address is unknown until the current load returns, leaving nothing to prefetch and the µs latency fully exposed.
Use cases: where disaggregation pays off
- KV-cache offload for long-context inference. The KV-cache grows linearly with sequence length and dominates HBM at long context (see the LLM KV-Cache Management page). Cold blocks for large or paused sessions spill to host/CXL/remote and are fetched back per layer during attention. Because KV access is append-mostly and read in order, it prefetches beautifully — letting a fixed HBM budget hold many more concurrent long-context sessions.
-
Optimizer-state offload in training. Adam keeps
first/second moments plus an FP32 master copy — roughly
2×the parameter bytes — that are only touched once per step in the optimizer phase. Parking them in CPU/CXL memory (ZeRO-Offload/Infinity) frees enormous HBM and lets a small GPU count train a model that would otherwise never fit, since the states stream in only when needed. - Large-embedding serving. Recommendation and retrieval models carry embedding tables of hundreds of gigabytes to terabytes. The pool holds the full table while HBM caches the hot rows; sparse gathers hit cache for popular IDs and fall through to the fabric for the long tail.
- MoE expert offload. Mixture-of-Experts activates only a few experts per token, so inactive experts can live in pooled memory and be streamed in on selection, trading a fetch for a large HBM saving.
| Disaggregation helps when… | …and hurts when |
|---|---|
| Access is predictable / prefetchable (layer sweeps, ordered KV) | Access is random & data-dependent (sparse gather, pointer chase) |
| Reuse is low — data is touched once per step or session | Reuse is high & tight — hot data thrashes across the fabric |
| Workload is capacity-bound and latency-tolerant (training, batch, long-context) | Latency-critical small-batch decode where every µs is on the critical path |
| Transfers overlap abundant compute | Steady state is already bandwidth-bound — no slack to hide movement |
Bottlenecks & scaling
| Bottleneck | Why it bites | Mitigation |
|---|---|---|
| Latency gap | HBM answers in ~100s ns, the pool in µs — a 10×–1000× gap that stalls SMs on any uncovered access | Prefetch ahead of use; overlap with compute; keep the working set resident; prefer CXL/C2C over RDMA for hot tiers |
| Interconnect bandwidth | PCIe/CXL/NIC links move tens of GB/s versus HBM's TB/s — a 50–100× cliff that caps sustained streaming | Move less data (good hot/cold split); compress/quantize; wider links (NVLink-C2C); stream in parallel with compute |
| Coherence overhead | Cache-coherent fabrics (CXL) pay snoop/directory traffic; shared pages across hosts add invalidation cost | Partition ownership; prefer read-mostly sharing; coarse-grained coherence; pin private hot data |
| Prefetch accuracy | A wrong prediction wastes bandwidth and still stalls on the real miss; irregular access defeats the predictor | Exploit known patterns (layer order, sequential KV); lookahead windows; fall back to demand-fetch gracefully |
| Cost & utilization | A pool only pays off if it is cheaper per GB and well utilized; idle pool capacity is just relocated stranding | Dynamic pooling/allocation across hosts; oversubscription with isolation; tier cold data to NVMe |
| Fault & tail domain | A pool node or link failure now affects many tenants; µs tiers add tail latency under contention | Replication/erasure for durable pools; bandwidth isolation/QoS; blast-radius limits; checkpoint critical state |
Summary
Disaggregated memory turns HBM from a hard per-package ceiling into the top tier of a managed hierarchy. Keep the hot working set in HBM, extend it with host DRAM, CXL pools, and RDMA far-memory, and let a virtual-memory-style manager classify hot/cold, prefetch along predictable access streams, and overlap every transfer with compute. The win is real — terabyte pools, reclaimed stranded capacity, and models that no longer have to fit one GPU — but it is bounded by a single hard constant: the latency gap between nanoseconds and microseconds. Pick the tier that matches the access pattern (CXL for transparent low-latency expansion, RDMA for elastic capacity, host offload for cheap bulk), exploit the regularity of ML access to hide that gap, and disaggregation buys capacity and utilization at a fraction of the cost of adding GPUs — fail to hide it, and the pool becomes a stall machine.