AI / ML Infrastructure
GPU Cluster Network Topology
At 10,000+ nodes (80k–100k GPUs), the network
is the computer. Synchronous training is bulk-synchronous:
every step ends with a collective — an
all-reduce of gradients, an all-to-all for
experts — and that collective runs at the speed of the
slowest link on the worst path. Add GPUs and the arithmetic
scales linearly, but the communication does not, so training becomes
communication-bound and the interconnect — not
the FLOPs — sets the ceiling on
model-FLOPs-utilization (MFU). This page is about
designing that interconnect: the bandwidth hierarchy, the topology
choices (fat-tree vs dragonfly vs
torus), rail-optimized fabrics, RDMA and
congestion control, and keeping it all alive when optics fail by the
hour.
Requirements
Most system-design problems are dominated by storage or
request/response latency. A training fabric is different: the workload
is one giant bulk-synchronous job where thousands of
GPUs march in lockstep. After every micro-step they exchange gradients
and activations through
collective communication —
all-reduce, all-gather,
reduce-scatter, all-to-all. Those primitives
are all-to-all-ish traffic patterns that stress the whole
bisection of the network at once, and they are
synchronization barriers: the step cannot advance
until the slowest rank finishes. That is why the network, not the
matrix multiply, is the design problem.
Compute grows with model FLOPs; communication grows with model
size divided by link bandwidth. As you scale out, the second
term wins — a poorly built network leaves tens of thousands of
GPUs idle waiting on an all-reduce. The goal of the
fabric is to make communication
cheap, uniform, and predictable enough that
collectives hide behind compute.
Functional requirements
-
Connect 10,000+ nodes (
~80k–100kGPUs) into a single logical fabric that any rank can reach any other rank. -
Efficiently support the collective primitives NCCL
relies on (
all-reduce,all-gather,reduce-scatter,broadcast,all-to-all) — not just point-to-point reachability. -
Provide RDMA with
GPUDirectso a NIC can DMA straight into remoteGPU HBM, bypassing the kernel and CPU. - Expose the physical topology to the scheduler and to NCCL so jobs can be placed and collectives ringed along the fastest links.
Non-functional requirements
| Quality | Why it matters here | Target |
|---|---|---|
| High bisection bandwidth | Collectives saturate the worst-case network cut; this is the headline number | Full / non-blocking (1:1) within a pod |
| Low, predictable latency | Bulk-synchronous step = speed of the slowest rank; tail and jitter create stragglers | Single-digit microseconds, bounded tail |
| Scalability | Must reach 10k+ nodes (and grow) without re-architecting the fabric | Add pods/islands incrementally |
| Cost | Optics + switches are a huge slice of cluster capex; long cables dominate | Right-size oversubscription, minimize global optics |
| Fault tolerance | Hundreds of thousands of links; optics flap and switches die continuously | Reroute & degrade gracefully, no full outage |
Bisection bandwidth is the metric to design for
Bisection bandwidth is the aggregate bandwidth
across the worst-case cut that splits the cluster in half. A ring or
tree all-reduce of a model with P parameters pushes
roughly 2P bytes across the fabric per step, and every
rank participates simultaneously — so the collective runs only
as fast as that cut allows. A non-blocking (1:1)
fabric provides full bisection: any permutation of node pairs can
talk at line rate at once. The moment you oversubscribe,
the cut becomes the bottleneck and MFU drops.
Bandwidth hierarchy
There is no single “network speed” in a GPU cluster — there is a steep hierarchy, and every tier you descend costs roughly an order of magnitude of bandwidth and adds latency. Understanding this cliff is the foundation of every placement and topology decision: keep the chattiest traffic on the fastest tier, and cross the slow tiers as rarely as possible.
| Tier | Technology | Scope | Bandwidth (per GPU) | Latency |
|---|---|---|---|---|
| Intra-GPU | HBM3 / HBM3e stacks |
On-package memory | ~3–8 TB/s |
Nanoseconds |
| Intra-node (scale-up) | NVLink via NVSwitch |
8–72 GPUs in one node/domain | ~900 GB/s |
Sub-microsecond |
| Inter-node (scale-out) | InfiniBand NDR / 400G RoCE |
Across the whole cluster | ~50 GB/s (400 Gb/s NIC) |
Low microseconds |
The drop from NVLink (~900 GB/s) to a single
400 Gb/s NIC (~50 GB/s) is roughly
18×. That gap is the
cost cliff: bandwidth inside a node is plentiful and
(relatively) cheap; bandwidth between nodes is scarce and expensive,
because it is built from optical transceivers and switch ports. This
is the difference between scale-up — putting
more GPUs into one tightly-coupled NVLink domain (e.g. an
NVL72 rack of 72 GPUs) — and
scale-out, adding more nodes across the
InfiniBand / Ethernet fabric. Scale-up gives you the
best-connected GPUs you can buy; scale-out gives you count,
at the price of crossing the slow tier.
A single GPU node is itself a small network. The diagram below shows
the three tiers that matter for design: GPUs talk to each other
through an NVSwitch at NVLink speed, each node reaches
the fabric through one or more RDMA NICs, and those NICs
climb a switching hierarchy of leaf (top-of-rack) and
spine
(core) switches.
flowchart TD
subgraph ND["GPU node (8x GPU)"]
G0["GPU 0"]
G1["GPU 1"]
G7["GPU 7"]
SW["NVSwitch (~900 GB/s)"]
NIC["RDMA NIC (~400 Gb/s)"]
end
G0 --> SW
G1 --> SW
G7 --> SW
SW --> NIC
L1["Leaf switch (ToR)"]
L2["Leaf switch (ToR)"]
S1["Spine switch"]
S2["Spine switch"]
NIC --> L1
L1 --> S1
L1 --> S2
L2 --> S1
L2 --> S2
The bandwidth hierarchy in one picture: HBM inside each GPU, NVLink/NVSwitch within the node (~900 GB/s), then a step down to RDMA NICs (~400 Gb/s) that climb leaf and spine switches to reach the rest of the cluster.
The golden rule: chattiest traffic on the fastest tier
Tensor-parallel all-reduce fires every layer,
so it must stay inside the NVLink domain. Pipeline
activations are small and point-to-point, so they can ride the
inter-node fabric. Data-parallel all-reduce is large
but happens once per step and overlaps with the backward pass, so it
too can span nodes. Mapping a chatty axis onto the slow tier is the
fastest way to waste a cluster.
Topologies: fat-tree vs dragonfly vs torus
Once traffic leaves the node it enters a switching topology, and the choice of topology fixes the cluster’s bisection bandwidth, its worst-case hop count (diameter), its cost, and how far it scales. Three families dominate the conversation.
Fat-tree (folded Clos)
A fat-tree is a multi-tier Clos network: nodes hang off leaf (top-of-rack) switches, leaves connect up to a spine tier, and large clusters add a third core tier. If each switch dedicates equal bandwidth up and down (1:1), the fabric is non-blocking and delivers full bisection bandwidth — any node can reach any other at line rate. Latency is uniform because every leaf-to-leaf path is the same small number of hops (3–5). The price is hardware: a non-blocking Clos needs a lot of switches and a lot of long optical cables, and the spine grows with the cluster. This is the default for GPU training clusters, almost always in a rail-optimized variant.
flowchart TD
subgraph SPINE["Spine tier (core)"]
S1["Spine 1"]
S2["Spine 2"]
end
subgraph LEAF["Leaf tier (ToR)"]
L1["Leaf 1"]
L2["Leaf 2"]
L3["Leaf 3"]
end
S1 --> L1
S1 --> L2
S1 --> L3
S2 --> L1
S2 --> L2
S2 --> L3
L1 --> R1["Rack 1 nodes"]
L2 --> R2["Rack 2 nodes"]
L3 --> R3["Rack 3 nodes"]
A non-blocking fat-tree: every leaf connects to every spine. Because all leaf-to-leaf paths traverse the same two hops through the spine, latency is uniform and the bisection is full.
Dragonfly
A dragonfly arranges routers into groups that are fully connected internally, with a sparse mesh of global (long, expensive optical) links between groups. The win is cost at extreme scale: it needs far fewer long-haul cables than a Clos while keeping the diameter tiny (~3 global hops). The catch is that those global links are oversubscribed, so a naive shortest-path route creates hotspots; dragonfly therefore requires adaptive routing (e.g. UGAL, which may send a packet the “long way” through an intermediate group to dodge congestion), and latency is non-uniform. Common in HPC fabrics (Cray Slingshot).
Torus
A torus wires each node directly to its neighbors in
a 2D/3D mesh whose edges wrap around. With few or no switches it is
the cheapest to cable and scales wiring linearly,
which is excellent for nearest-neighbor communication. But
its diameter grows with the cube root of node count,
so distant pairs suffer many hops and high latency, and global
all-reduce at huge scale is poor unless the collective is
mapped carefully to the mesh. Google TPU pods use a 3D torus,
augmented with optical circuit switches for reconfiguration and fault
isolation.
| Topology | Bisection bandwidth | Diameter / hops | Cost (optics) | Routing | Best for |
|---|---|---|---|---|---|
| Fat-tree (Clos) | Full if non-blocking (1:1) | Low, uniform (3–5) | High — many switches + long cables | Static ECMP works | GPU training clusters (rail-optimized) |
| Dragonfly | High but global links oversubscribed | Very low (~3 global hops) | Lower — fewer global optics | Adaptive required (UGAL) | Very large HPC, cost-sensitive scale-out |
| Torus (2D/3D) | Low; depends on dimensions | Grows as N^(1/d), non-uniform |
Lowest — few/no switches | Dimension-order + fault rerouting | Nearest-neighbor patterns; TPU pods |
Oversubscription and network islands
Oversubscription is the ratio of downlink (to
nodes) to uplink (to spine) bandwidth. 1:1 is
non-blocking and full bisection; 2:1 or
4:1 is cheaper but throttles any collective that
crosses the spine. GPU fabrics therefore build
non-blocking pods / islands — a fat-tree unit
serving, say, 1k–4k GPUs at full bisection — and connect
islands with a higher, sometimes oversubscribed, tier. The
scheduling rule that follows is simple:
keep a job inside one island so its collectives
never pay the oversubscribed crossing.
Topology-aware collectives & placement
A fast fabric is necessary but not sufficient: the collective library
and the scheduler have to
map the communication pattern onto the physical wires. NCCL builds rings and
trees across the detected topology. A
ring all-reduce is bandwidth-optimal —
each GPU sends and receives only 2(N-1)/N of the data
— and is preferred for the large gradient buffers of
data-parallel training. A tree all-reduce is
latency-optimal (logarithmic depth) and wins for small
messages or very large scale. NCCL auto-detects NVLink, PCIe, and NICs
and lays its channels along the fastest links.
Rail alignment
The dominant GPU-cluster layout is the
rail-optimized fat-tree. Each of a node’s 8
GPUs has its own NIC wired to a distinct rail —
a dedicated leaf switch. All the GPU-0s across every node
share rail 0, all the GPU-1s share rail 1, and
so on. The payoff: a collective among same-rank GPUs stays
within a single rail (one leaf hop, no spine
crossing). Cross-rail reshuffles are handled
inside the node over NVLink. The result is a hierarchical
collective — reduce-scatter on NVLink,
all-reduce across the rail over the fabric,
all-gather back on NVLink — that keeps almost all
bytes on the fastest available tier.
flowchart TD
subgraph NA["Node A"]
A0["GPU 0"]
A1["GPU 1"]
end
subgraph NB["Node B"]
B0["GPU 0"]
B1["GPU 1"]
end
R0["Rail 0 leaf switch"]
R1["Rail 1 leaf switch"]
A0 --> R0
B0 --> R0
A1 --> R1
B1 --> R1
A0 --- A1
B0 --- B1
Rail-optimized fabric: same-index GPUs share a rail (leaf switch), so their all-reduce never touches the spine. NVLink (the dashed intra-node links) handles cross-rail shuffles, keeping bytes on the fast tier.
Placement and congestion
Because oversubscription bites only when traffic crosses tiers,
where a job lands is a first-class performance
decision. A topology-aware scheduler co-locates each data-parallel
all-reduce group on adjacent leaves within one island and
aligns ranks to rails; a job scattered across the fabric pays repeated
spine crossings and stalls on every collective. Even on a non-blocking
fabric, flow collisions hurt: ECMP hashes a few
elephant flows onto the same uplink, leaving other links idle.
Adaptive routing (per-packet or flowlet spraying)
spreads those flows, and in-network reduction (SHARP)
performs the sum inside the switch, roughly halving the bytes on the
wire. The scheduling side of this story — gang placement and
topology-aware bin-packing — is covered in the
GPU cluster scheduler.
Topology-aware scheduling is free performance
The same job, same GPUs, same model can differ by tens of percent of MFU depending only on placement. Packed into one rail-aligned island, its collectives are a single leaf hop; scattered across the cluster, they thrash an oversubscribed spine. No new hardware required — just teach the scheduler the fabric graph.
RDMA & congestion control
Moving gradients at 400 Gb/s with the CPU in the loop is
impossible — the copies and interrupts alone would dominate.
RDMA (Remote Direct Memory Access) lets a NIC read
and write remote memory directly, with kernel bypass and
zero copy, for microsecond latency at near-zero CPU cost.
GPUDirect RDMA goes one step further: the NIC DMAs
straight into and out of GPU HBM, skipping the bounce
through host memory entirely. Two fabrics deliver it.
| Dimension | InfiniBand | RoCEv2 (RDMA over Ethernet) |
|---|---|---|
| Losslessness | Lossless by design (credit-based flow control) | Ethernet is lossy — must be engineered lossless |
| Congestion control | Built-in; subnet manager programs routes |
PFC + ECN / DCQCN,
carefully tuned
|
| Ecosystem | Purpose-built, single-vendor, mature | Commodity Ethernet, multi-vendor, familiar |
| Cost | Higher per port | Lower — reuses Ethernet supply chain |
| Best when | Out-of-box performance & simplicity matter | Cost & vendor flexibility matter, with strong network team |
RDMA NICs historically tolerate packet loss badly (a single drop can trigger expensive retransmission and crater throughput), so the fabric must be effectively lossless. InfiniBand achieves this natively with credit-based flow control. RoCEv2 must build losslessness on Ethernet: PFC (Priority Flow Control) pauses a traffic class before a switch buffer overflows, while ECN marks packets under load and DCQCN reacts by throttling the sender’s rate to keep queues short. The art is keeping queues shallow enough that PFC rarely fires — because PFC pauses cause head-of-line blocking and can spread congestion (even deadlock) across the fabric if mistuned.
Incast: the many-to-one killer
Incast is the pathological pattern where many
senders converge on one receiver at the same instant — the
final stage of a reduce, or many ranks pushing to one
parameter shard. Their packets overflow the receiver’s switch
buffer, triggering drops or a storm of PFC pauses, and throughput
collapses. Defenses: aggressive ECN/DCQCN to hold queues short,
deeper switch buffers, in-network reduction so the
switch sums instead of forwarding every flow, and collective
algorithms (rings) that spread rather than concentrate traffic.
Fault tolerance
At 10,000+ nodes the fabric contains hundreds of thousands of optical links, and optics are the single largest source of failures. With that many components, something is always broken — a link flaps, a transceiver degrades, a switch dies. The fabric must survive this as routine, not as an incident, because a stalled fabric idles the entire (very expensive) job.
- Path redundancy. A multi-tier fat-tree offers many equal-cost paths (every leaf reaches every spine). Losing one spine or one uplink doesn’t cut connectivity — it shaves bisection bandwidth (lose 1 of 8 spines ≈ a 12.5% capacity hit), degrading gracefully instead of failing.
- Adaptive rerouting. A fabric manager (the InfiniBand subnet manager or an Ethernet controller) detects a link-down event, recomputes routes, and reprograms switch forwarding tables. Adaptive routing steers individual flows around the dead or congested link without restarting the job.
-
Gray failures. The dangerous case isn’t a
clean death — it’s a flaky optic with rising
CRC/ symbol-error counters that quietly slows one link. Because collectives run at the speed of the slowest rank, one sick link becomes a cluster-wide straggler. Detect it via per-link error counters and per-rank step-time, then quarantine the hardware before it tanks the job. - Draining a rail. When a leaf/rail switch degrades past tolerance, drain it: stop scheduling onto it, migrate or restart the affected ranks on healthy hardware, and flag it for repair. Paired with gang scheduling and checkpoint/restore, the job resumes from its last checkpoint on a clean island.
- Bounded blast radius. A core-switch failure touches far more traffic than a leaf. Build the fabric so the unit of failure is contained to a pod/island, and let the scheduler route new jobs around drained zones.
sequenceDiagram
participant N as Node NIC
participant L as Leaf switch
participant FM as Fabric manager
participant SCH as Scheduler
N->>L: send collective on rail 2
Note over L: spine uplink fails (CRC errors)
L->>FM: report link down
FM->>FM: recompute routes (adaptive)
FM-->>N: reroute via healthy spine
Note over FM,SCH: errors persist on the rail
FM->>SCH: drain rail 2, mark node suspect
SCH->>SCH: restart ranks on healthy island
Detect, reroute, drain: a failing uplink is rerouted in-fabric for transient errors; if errors persist, the rail is drained and the scheduler restarts affected ranks from the last checkpoint on a healthy island.
Bottlenecks & scaling
Scaling an interconnect from one rack to 10k+ nodes is a march from one bottleneck to the next. Each one shows up as a dip in MFU, and each mitigation trades cost, complexity, or bandwidth for predictability.
| Bottleneck | Symptom | Mitigation |
|---|---|---|
| Bisection bandwidth | Cross-cluster collectives stall; MFU falls as the job spreads | Non-blocking fat-tree, island packing, SHARP in-network reduction |
| Oversubscription | Cheaper spine throttles any traffic that crosses tiers | 1:1 within the pod, rail-optimized placement, keep jobs in one island |
| Congestion / incast | ECMP flow collisions; many-to-one buffer overflow; PFC pauses spread | Adaptive routing / packet spraying, ECN/DCQCN tuning, in-network reduction |
| Cost | Optics and switches dominate capex; long cables are the priciest part | Dragonfly to cut global optics, copper within rack, right-size oversubscription |
| Blast radius | Link/switch and gray failures gate whole collectives | Redundant paths, fabric-manager rerouting, rail draining, error-counter quarantine |
Summary
Designing the interconnect for a 10,000+ node GPU cluster is
fundamentally a
bandwidth-vs-cost-vs-blast-radius balancing act,
anchored on one metric: bisection bandwidth.
Respect the hierarchy — keep tensor-parallel
chatter on NVLink and cross the
InfiniBand/RoCE tier as little as
possible. Build a
non-blocking, rail-optimized fat-tree in pods,
reaching for dragonfly only when global-optics cost
forces the trade and a torus only for
nearest-neighbor patterns. Make the fabric
lossless with RDMA, PFC/ECN, and congestion control
to survive incast, then
place jobs topology-aware so collectives ring along
the fastest links. Finally, treat
failure as routine: redundant paths, adaptive
rerouting, and rail draining keep the fleet busy — which, at
this price per GPU-hour, is the entire game.