System Design Notes All designs

AI / ML Infrastructure

GPU Cluster Network Topology

At 10,000+ nodes (80k–100k GPUs), the network is the computer. Synchronous training is bulk-synchronous: every step ends with a collective — an all-reduce of gradients, an all-to-all for experts — and that collective runs at the speed of the slowest link on the worst path. Add GPUs and the arithmetic scales linearly, but the communication does not, so training becomes communication-bound and the interconnect — not the FLOPs — sets the ceiling on model-FLOPs-utilization (MFU). This page is about designing that interconnect: the bandwidth hierarchy, the topology choices (fat-tree vs dragonfly vs torus), rail-optimized fabrics, RDMA and congestion control, and keeping it all alive when optics fail by the hour.

Requirements

Most system-design problems are dominated by storage or request/response latency. A training fabric is different: the workload is one giant bulk-synchronous job where thousands of GPUs march in lockstep. After every micro-step they exchange gradients and activations through collective communicationall-reduce, all-gather, reduce-scatter, all-to-all. Those primitives are all-to-all-ish traffic patterns that stress the whole bisection of the network at once, and they are synchronization barriers: the step cannot advance until the slowest rank finishes. That is why the network, not the matrix multiply, is the design problem.

Compute grows with model FLOPs; communication grows with model size divided by link bandwidth. As you scale out, the second term wins — a poorly built network leaves tens of thousands of GPUs idle waiting on an all-reduce. The goal of the fabric is to make communication cheap, uniform, and predictable enough that collectives hide behind compute.

Functional requirements

Non-functional requirements

Quality Why it matters here Target
High bisection bandwidth Collectives saturate the worst-case network cut; this is the headline number Full / non-blocking (1:1) within a pod
Low, predictable latency Bulk-synchronous step = speed of the slowest rank; tail and jitter create stragglers Single-digit microseconds, bounded tail
Scalability Must reach 10k+ nodes (and grow) without re-architecting the fabric Add pods/islands incrementally
Cost Optics + switches are a huge slice of cluster capex; long cables dominate Right-size oversubscription, minimize global optics
Fault tolerance Hundreds of thousands of links; optics flap and switches die continuously Reroute & degrade gracefully, no full outage

Bisection bandwidth is the metric to design for

Bisection bandwidth is the aggregate bandwidth across the worst-case cut that splits the cluster in half. A ring or tree all-reduce of a model with P parameters pushes roughly 2P bytes across the fabric per step, and every rank participates simultaneously — so the collective runs only as fast as that cut allows. A non-blocking (1:1) fabric provides full bisection: any permutation of node pairs can talk at line rate at once. The moment you oversubscribe, the cut becomes the bottleneck and MFU drops.

Bandwidth hierarchy

There is no single “network speed” in a GPU cluster — there is a steep hierarchy, and every tier you descend costs roughly an order of magnitude of bandwidth and adds latency. Understanding this cliff is the foundation of every placement and topology decision: keep the chattiest traffic on the fastest tier, and cross the slow tiers as rarely as possible.

Tier Technology Scope Bandwidth (per GPU) Latency
Intra-GPU HBM3 / HBM3e stacks On-package memory ~3–8 TB/s Nanoseconds
Intra-node (scale-up) NVLink via NVSwitch 8–72 GPUs in one node/domain ~900 GB/s Sub-microsecond
Inter-node (scale-out) InfiniBand NDR / 400G RoCE Across the whole cluster ~50 GB/s (400 Gb/s NIC) Low microseconds

The drop from NVLink (~900 GB/s) to a single 400 Gb/s NIC (~50 GB/s) is roughly 18×. That gap is the cost cliff: bandwidth inside a node is plentiful and (relatively) cheap; bandwidth between nodes is scarce and expensive, because it is built from optical transceivers and switch ports. This is the difference between scale-up — putting more GPUs into one tightly-coupled NVLink domain (e.g. an NVL72 rack of 72 GPUs) — and scale-out, adding more nodes across the InfiniBand / Ethernet fabric. Scale-up gives you the best-connected GPUs you can buy; scale-out gives you count, at the price of crossing the slow tier.

A single GPU node is itself a small network. The diagram below shows the three tiers that matter for design: GPUs talk to each other through an NVSwitch at NVLink speed, each node reaches the fabric through one or more RDMA NICs, and those NICs climb a switching hierarchy of leaf (top-of-rack) and spine (core) switches.

flowchart TD
  subgraph ND["GPU node (8x GPU)"]
    G0["GPU 0"]
    G1["GPU 1"]
    G7["GPU 7"]
    SW["NVSwitch (~900 GB/s)"]
    NIC["RDMA NIC (~400 Gb/s)"]
  end
  G0 --> SW
  G1 --> SW
  G7 --> SW
  SW --> NIC
  L1["Leaf switch (ToR)"]
  L2["Leaf switch (ToR)"]
  S1["Spine switch"]
  S2["Spine switch"]
  NIC --> L1
  L1 --> S1
  L1 --> S2
  L2 --> S1
  L2 --> S2
      

The bandwidth hierarchy in one picture: HBM inside each GPU, NVLink/NVSwitch within the node (~900 GB/s), then a step down to RDMA NICs (~400 Gb/s) that climb leaf and spine switches to reach the rest of the cluster.

The golden rule: chattiest traffic on the fastest tier

Tensor-parallel all-reduce fires every layer, so it must stay inside the NVLink domain. Pipeline activations are small and point-to-point, so they can ride the inter-node fabric. Data-parallel all-reduce is large but happens once per step and overlaps with the backward pass, so it too can span nodes. Mapping a chatty axis onto the slow tier is the fastest way to waste a cluster.

Topologies: fat-tree vs dragonfly vs torus

Once traffic leaves the node it enters a switching topology, and the choice of topology fixes the cluster’s bisection bandwidth, its worst-case hop count (diameter), its cost, and how far it scales. Three families dominate the conversation.

Fat-tree (folded Clos)

A fat-tree is a multi-tier Clos network: nodes hang off leaf (top-of-rack) switches, leaves connect up to a spine tier, and large clusters add a third core tier. If each switch dedicates equal bandwidth up and down (1:1), the fabric is non-blocking and delivers full bisection bandwidth — any node can reach any other at line rate. Latency is uniform because every leaf-to-leaf path is the same small number of hops (3–5). The price is hardware: a non-blocking Clos needs a lot of switches and a lot of long optical cables, and the spine grows with the cluster. This is the default for GPU training clusters, almost always in a rail-optimized variant.

flowchart TD
  subgraph SPINE["Spine tier (core)"]
    S1["Spine 1"]
    S2["Spine 2"]
  end
  subgraph LEAF["Leaf tier (ToR)"]
    L1["Leaf 1"]
    L2["Leaf 2"]
    L3["Leaf 3"]
  end
  S1 --> L1
  S1 --> L2
  S1 --> L3
  S2 --> L1
  S2 --> L2
  S2 --> L3
  L1 --> R1["Rack 1 nodes"]
  L2 --> R2["Rack 2 nodes"]
  L3 --> R3["Rack 3 nodes"]
      

A non-blocking fat-tree: every leaf connects to every spine. Because all leaf-to-leaf paths traverse the same two hops through the spine, latency is uniform and the bisection is full.

Dragonfly

A dragonfly arranges routers into groups that are fully connected internally, with a sparse mesh of global (long, expensive optical) links between groups. The win is cost at extreme scale: it needs far fewer long-haul cables than a Clos while keeping the diameter tiny (~3 global hops). The catch is that those global links are oversubscribed, so a naive shortest-path route creates hotspots; dragonfly therefore requires adaptive routing (e.g. UGAL, which may send a packet the “long way” through an intermediate group to dodge congestion), and latency is non-uniform. Common in HPC fabrics (Cray Slingshot).

Torus

A torus wires each node directly to its neighbors in a 2D/3D mesh whose edges wrap around. With few or no switches it is the cheapest to cable and scales wiring linearly, which is excellent for nearest-neighbor communication. But its diameter grows with the cube root of node count, so distant pairs suffer many hops and high latency, and global all-reduce at huge scale is poor unless the collective is mapped carefully to the mesh. Google TPU pods use a 3D torus, augmented with optical circuit switches for reconfiguration and fault isolation.

Topology Bisection bandwidth Diameter / hops Cost (optics) Routing Best for
Fat-tree (Clos) Full if non-blocking (1:1) Low, uniform (3–5) High — many switches + long cables Static ECMP works GPU training clusters (rail-optimized)
Dragonfly High but global links oversubscribed Very low (~3 global hops) Lower — fewer global optics Adaptive required (UGAL) Very large HPC, cost-sensitive scale-out
Torus (2D/3D) Low; depends on dimensions Grows as N^(1/d), non-uniform Lowest — few/no switches Dimension-order + fault rerouting Nearest-neighbor patterns; TPU pods

Oversubscription and network islands

Oversubscription is the ratio of downlink (to nodes) to uplink (to spine) bandwidth. 1:1 is non-blocking and full bisection; 2:1 or 4:1 is cheaper but throttles any collective that crosses the spine. GPU fabrics therefore build non-blocking pods / islands — a fat-tree unit serving, say, 1k–4k GPUs at full bisection — and connect islands with a higher, sometimes oversubscribed, tier. The scheduling rule that follows is simple: keep a job inside one island so its collectives never pay the oversubscribed crossing.

Topology-aware collectives & placement

A fast fabric is necessary but not sufficient: the collective library and the scheduler have to map the communication pattern onto the physical wires. NCCL builds rings and trees across the detected topology. A ring all-reduce is bandwidth-optimal — each GPU sends and receives only 2(N-1)/N of the data — and is preferred for the large gradient buffers of data-parallel training. A tree all-reduce is latency-optimal (logarithmic depth) and wins for small messages or very large scale. NCCL auto-detects NVLink, PCIe, and NICs and lays its channels along the fastest links.

Rail alignment

The dominant GPU-cluster layout is the rail-optimized fat-tree. Each of a node’s 8 GPUs has its own NIC wired to a distinct rail — a dedicated leaf switch. All the GPU-0s across every node share rail 0, all the GPU-1s share rail 1, and so on. The payoff: a collective among same-rank GPUs stays within a single rail (one leaf hop, no spine crossing). Cross-rail reshuffles are handled inside the node over NVLink. The result is a hierarchical collective — reduce-scatter on NVLink, all-reduce across the rail over the fabric, all-gather back on NVLink — that keeps almost all bytes on the fastest available tier.

flowchart TD
  subgraph NA["Node A"]
    A0["GPU 0"]
    A1["GPU 1"]
  end
  subgraph NB["Node B"]
    B0["GPU 0"]
    B1["GPU 1"]
  end
  R0["Rail 0 leaf switch"]
  R1["Rail 1 leaf switch"]
  A0 --> R0
  B0 --> R0
  A1 --> R1
  B1 --> R1
  A0 --- A1
  B0 --- B1
      

Rail-optimized fabric: same-index GPUs share a rail (leaf switch), so their all-reduce never touches the spine. NVLink (the dashed intra-node links) handles cross-rail shuffles, keeping bytes on the fast tier.

Placement and congestion

Because oversubscription bites only when traffic crosses tiers, where a job lands is a first-class performance decision. A topology-aware scheduler co-locates each data-parallel all-reduce group on adjacent leaves within one island and aligns ranks to rails; a job scattered across the fabric pays repeated spine crossings and stalls on every collective. Even on a non-blocking fabric, flow collisions hurt: ECMP hashes a few elephant flows onto the same uplink, leaving other links idle. Adaptive routing (per-packet or flowlet spraying) spreads those flows, and in-network reduction (SHARP) performs the sum inside the switch, roughly halving the bytes on the wire. The scheduling side of this story — gang placement and topology-aware bin-packing — is covered in the GPU cluster scheduler.

Topology-aware scheduling is free performance

The same job, same GPUs, same model can differ by tens of percent of MFU depending only on placement. Packed into one rail-aligned island, its collectives are a single leaf hop; scattered across the cluster, they thrash an oversubscribed spine. No new hardware required — just teach the scheduler the fabric graph.

RDMA & congestion control

Moving gradients at 400 Gb/s with the CPU in the loop is impossible — the copies and interrupts alone would dominate. RDMA (Remote Direct Memory Access) lets a NIC read and write remote memory directly, with kernel bypass and zero copy, for microsecond latency at near-zero CPU cost. GPUDirect RDMA goes one step further: the NIC DMAs straight into and out of GPU HBM, skipping the bounce through host memory entirely. Two fabrics deliver it.

Dimension InfiniBand RoCEv2 (RDMA over Ethernet)
Losslessness Lossless by design (credit-based flow control) Ethernet is lossy — must be engineered lossless
Congestion control Built-in; subnet manager programs routes PFC + ECN / DCQCN, carefully tuned
Ecosystem Purpose-built, single-vendor, mature Commodity Ethernet, multi-vendor, familiar
Cost Higher per port Lower — reuses Ethernet supply chain
Best when Out-of-box performance & simplicity matter Cost & vendor flexibility matter, with strong network team

RDMA NICs historically tolerate packet loss badly (a single drop can trigger expensive retransmission and crater throughput), so the fabric must be effectively lossless. InfiniBand achieves this natively with credit-based flow control. RoCEv2 must build losslessness on Ethernet: PFC (Priority Flow Control) pauses a traffic class before a switch buffer overflows, while ECN marks packets under load and DCQCN reacts by throttling the sender’s rate to keep queues short. The art is keeping queues shallow enough that PFC rarely fires — because PFC pauses cause head-of-line blocking and can spread congestion (even deadlock) across the fabric if mistuned.

Incast: the many-to-one killer

Incast is the pathological pattern where many senders converge on one receiver at the same instant — the final stage of a reduce, or many ranks pushing to one parameter shard. Their packets overflow the receiver’s switch buffer, triggering drops or a storm of PFC pauses, and throughput collapses. Defenses: aggressive ECN/DCQCN to hold queues short, deeper switch buffers, in-network reduction so the switch sums instead of forwarding every flow, and collective algorithms (rings) that spread rather than concentrate traffic.

Fault tolerance

At 10,000+ nodes the fabric contains hundreds of thousands of optical links, and optics are the single largest source of failures. With that many components, something is always broken — a link flaps, a transceiver degrades, a switch dies. The fabric must survive this as routine, not as an incident, because a stalled fabric idles the entire (very expensive) job.

sequenceDiagram
  participant N as Node NIC
  participant L as Leaf switch
  participant FM as Fabric manager
  participant SCH as Scheduler
  N->>L: send collective on rail 2
  Note over L: spine uplink fails (CRC errors)
  L->>FM: report link down
  FM->>FM: recompute routes (adaptive)
  FM-->>N: reroute via healthy spine
  Note over FM,SCH: errors persist on the rail
  FM->>SCH: drain rail 2, mark node suspect
  SCH->>SCH: restart ranks on healthy island
      

Detect, reroute, drain: a failing uplink is rerouted in-fabric for transient errors; if errors persist, the rail is drained and the scheduler restarts affected ranks from the last checkpoint on a healthy island.

Bottlenecks & scaling

Scaling an interconnect from one rack to 10k+ nodes is a march from one bottleneck to the next. Each one shows up as a dip in MFU, and each mitigation trades cost, complexity, or bandwidth for predictability.

Bottleneck Symptom Mitigation
Bisection bandwidth Cross-cluster collectives stall; MFU falls as the job spreads Non-blocking fat-tree, island packing, SHARP in-network reduction
Oversubscription Cheaper spine throttles any traffic that crosses tiers 1:1 within the pod, rail-optimized placement, keep jobs in one island
Congestion / incast ECMP flow collisions; many-to-one buffer overflow; PFC pauses spread Adaptive routing / packet spraying, ECN/DCQCN tuning, in-network reduction
Cost Optics and switches dominate capex; long cables are the priciest part Dragonfly to cut global optics, copper within rack, right-size oversubscription
Blast radius Link/switch and gray failures gate whole collectives Redundant paths, fabric-manager rerouting, rail draining, error-counter quarantine

Summary

Designing the interconnect for a 10,000+ node GPU cluster is fundamentally a bandwidth-vs-cost-vs-blast-radius balancing act, anchored on one metric: bisection bandwidth. Respect the hierarchy — keep tensor-parallel chatter on NVLink and cross the InfiniBand/RoCE tier as little as possible. Build a non-blocking, rail-optimized fat-tree in pods, reaching for dragonfly only when global-optics cost forces the trade and a torus only for nearest-neighbor patterns. Make the fabric lossless with RDMA, PFC/ECN, and congestion control to survive incast, then place jobs topology-aware so collectives ring along the fastest links. Finally, treat failure as routine: redundant paths, adaptive rerouting, and rail draining keep the fleet busy — which, at this price per GPU-hour, is the entire game.