System Design Notes All designs

AI / ML Infrastructure

Distributed GPU Training Cluster

How do you train a 100B-1T parameter model across ~100,000 GPUs and keep them all busy? At this scale the model no longer fits on a single GPU, the network — not the arithmetic — becomes the limiter, and hardware fails every few hours. The job is a careful composition of three axes of parallelism, bandwidth-optimal collective communication, and relentless fault tolerance, all tuned to push model-FLOPs-utilization (MFU) as high as it will go. Idle GPUs burn money by the minute, so every decision trades memory against communication against compute.

Requirements

A training platform is a batch system, not a request/response service: there are no user-facing latency SLAs, but there is a brutal cost clock. The product is throughput per dollar — how many useful model-FLOPs you extract from the fleet before the next failure forces a restart.

Functional requirements

Non-functional requirements

Scale & back-of-the-envelope

Three numbers drive every later decision: the memory a model needs versus what a GPU has, the bandwidth hierarchy (HBM > NVLink > InfiniBand), and the failure rate of a 100k-GPU fleet. Assume H100-class accelerators throughout.

Quantity Figure Math / why it matters
Fleet size ~100,000 H100 SXM GPUs ~12,500 8-GPU nodes on a rail-optimized fabric.
Peak compute / GPU ~1 PFLOP/s BF16 (dense ~990 TFLOP/s) Fleet peak ~100 EFLOP/s. Real work is MFU x peak.
Effective compute MFU 35-55% At 40% MFU: ~40 EFLOP/s usable across the fleet.
HBM per GPU 80 GB HBM3 Model shard + activations + optimizer slice must all fit here.
HBM bandwidth ~3.35 TB/s (H100; ~2 TB/s on A100) The per-GPU "memory wall" that gates kernel throughput.
NVLink (intra-node) ~900 GB/s per GPU via NVSwitch ~25x faster than IB — host the chatty TP here.
InfiniBand (inter-node) ~400 Gb/s per GPU (NDR), ~3.2 Tb/s/node Host point-to-point PP and bursty DP here.
Model + optimizer state ~16 bytes / param bf16 weights (2) + bf16 grads (2) + fp32 master (4) + Adam m (4) + Adam v (4) = 16. 1T params -> 16 TB; 175B -> 2.8 TB.
Min GPUs to hold 1T model ~200 GPUs (state only) 16 TB / 80 GB ~ 200, before any activations -> you must shard.
Global batch ~4M tokens / step e.g. seq-len 8k x ~512 sequences; sized for convergence, not just HW.
Throughput ~5.8e11 tokens/day (~576B) C = 6 x N x D (~6 FLOPs/param/token). 4e19 FLOP/s / (6 x 1e12) ~ 6.7M tok/s.
Train 15T tokens ~26 days 15e12 / 5.76e11 ~ 26 days of wall-clock on a healthy fleet.
Checkpoint size ~16 TB (1T model, full state) Naive single-stream write = hours -> needs sharded async I/O.
Fleet MTBF node_MTBF / N 5e4 h per node / 1e5 nodes ~ 0.5 h -> a failure roughly every 30 minutes.

The two walls

A 1T-parameter model needs ~16 TB of state but a GPU has 80 GB — a 200x memory gap. And because a 100k-GPU fleet fails roughly every 30 minutes, you cannot run for 26 days without treating failure as routine. Sharding answers the first wall; checkpointing + elasticity answers the second.

Parallelism: the core deep dive

No single technique scales a 1T model to 100k GPUs. You compose three orthogonal axes of parallelism, then shard the leftover state with ZeRO/FSDP. The art is mapping each axis onto the matching tier of the network so its communication stays cheap.

The four techniques

3D parallelism: composing the axes

Total GPUs = TP x PP x DP. A concrete 100k-GPU mapping might be TP=8 (one node, NVLink) x PP=16 stages (across nodes, IB) x DP=781 replicas (across the rest, IB) = ~99,968 GPUs, with ZeRO sharding layered on the DP axis to fit the optimizer state.

flowchart TD
  C["Cluster: 100k GPUs = TP x PP x DP"]
  C --> DP["Data-Parallel axis (DP = 781 replicas)"]
  DP --> REP["One full model replica"]
  REP --> PP["Pipeline-Parallel axis (PP = 16 stages)"]
  PP --> ST["One stage = a slab of layers"]
  ST --> TP["Tensor-Parallel axis (TP = 8, one node)"]
  TP --> N0["GPU 0 (NVLink)"]
  TP --> N1["GPU 1 (NVLink)"]
  TP --> N7["GPU 7 (NVLink)"]
  ZeRO["ZeRO / FSDP shards optimizer+grads+params across DP"] -.-> DP
      

3D parallelism nests the axes: tensor-parallel inside a node, pipeline stages and data-parallel replicas across nodes, ZeRO sharding folded onto the DP axis.

Axis Splits Communication Network tier Main cost / limit
Data (DP) The batch (replicas) all-reduce of grads, once/step InfiniBand Global batch ceiling; full model must fit (unless ZeRO)
Tensor (TP) Inside each layer (matmuls) all-reduce every layer NVLink (intra-node) Very chatty -> keep degree at 8 (one node)
Pipeline (PP) Layers into stages Point-to-point activations InfiniBand Pipeline bubble; needs many micro-batches
ZeRO / FSDP Optimizer / grad / param shards all-gather + reduce-scatter NVLink + IB Extra gather per layer; must overlap with compute

Map parallelism to the network

The golden rule: put the chattiest communication on the fastest link. Tensor-parallel all-reduces hit every layer, so they live on NVLink inside a node. Pipeline activations are tiny and point-to-point, so they ride InfiniBand between nodes. Data-parallel all-reduce is large but happens once per step, so it can also span nodes — and overlaps with the backward pass. Mismatching an axis to the wrong tier is the fastest way to tank MFU.

Collective communication

Every parallelism axis ultimately reduces to a handful of collectives executed by NCCL (NVIDIA Collective Communications Library). Their cost, and how well they overlap with compute, sets the ceiling on scaling efficiency.

Ring vs tree all-reduce

Ring all-reduce is bandwidth-optimal: arrange ranks in a ring, and each sends 2 x (N-1)/N of the data — independent of fleet size for the bandwidth term, but latency grows with the ring length. Tree / double-binary-tree all-reduce is latency-optimal at O(log N) hops — better for small messages or very large N. NCCL auto-selects the algorithm by message size and topology, and can offload the reduction into the switch with SHARP (in-network compute).

flowchart LR
  A["GPU 0"] --> B["GPU 1"]
  B --> C["GPU 2"]
  C --> D["GPU 3"]
  D --> A
      

Ring all-reduce: data flows one direction around the ring as N-1 reduce-scatter steps then N-1 all-gather steps, saturating every link's bandwidth.

Hiding communication under compute

Why this dominates at scale

Compute per GPU is fixed, but data-parallel communication volume per step is roughly constant while the number of participants grows. As you scale DP, the all-reduce becomes the long pole — which is exactly why overlap, bucketing, and topology-aware NCCL are the difference between 50% and 20% MFU.

Fault tolerance

At 100k GPUs a hardware fault lands roughly every 30 minutes to a few hours: GPU ECC / Xid errors, NVLink or NVSwitch faults, NIC/optics flaps, host OOM, thermal throttling, network partitions, and the nastiest of all — silent data corruption (SDC) that produces wrong numbers without crashing. In synchronous training a single dead rank stalls the entire job, so recovery speed is throughput.

The toolbox

sequenceDiagram
  participant W as Worker Rank
  participant HM as Health Monitor
  participant SCH as Scheduler
  participant CK as Checkpoint Store
  W->>HM: heartbeat every 5s
  Note over W,HM: GPU Xid error, heartbeat times out
  HM->>SCH: report dead rank
  SCH->>SCH: pause collectives, swap in hot spare
  SCH->>CK: request latest sharded checkpoint
  CK-->>W: stream weights + optimizer shard
  W->>SCH: ready, rejoin process group
  SCH->>W: resume from last good step
      

Failure -> detect -> restore: heartbeat loss triggers the scheduler to swap a hot spare in and rehydrate its shard from the last checkpoint before the gang resumes.

Recovery time is throughput

If recovery takes 20 minutes and you fail every 2 hours, you lose ~17% of wall-clock to downtime — before counting the work redone since the last checkpoint. Async in-memory checkpointing plus hot-spare swaps shrink that window from tens of minutes to under a minute, which is worth more MFU than most kernel optimizations.

High-level design

The platform splits into a control plane (scheduling and health), a data pipeline (feeding tokens), the training fabric (the GPUs and their collectives), and shared storage (checkpoints) wrapped in observability.

flowchart TD
  subgraph CTRL["Control plane"]
    SCH["Gang scheduler (topology-aware)"]
    HM["Health monitor / heartbeats"]
    TRK["Experiment tracker"]
  end
  subgraph DATA["Data pipeline"]
    OBJ["Sharded dataset (object store)"]
    LD["Streaming dataloaders"]
  end
  subgraph FAB["Training fabric (100k GPUs)"]
    WG0["Worker group 0"]
    WG1["Worker group 1"]
    WGN["Worker group N"]
  end
  CKPT["Checkpoint store (parallel FS + object)"]
  MON["Metrics: MFU, loss, grad-norm, step-time"]

  SCH --> WG0
  SCH --> WG1
  SCH --> WGN
  OBJ --> LD
  LD --> WG0
  LD --> WG1
  LD --> WGN
  WG0 --> CKPT
  WG1 --> CKPT
  WGN --> CKPT
  WG0 --> MON
  WG1 --> MON
  WGN --> MON
  MON --> HM
  HM --> SCH
  SCH --> TRK
      

Control plane schedules and watches; dataloaders stream shards into worker groups; every group checkpoints to shared storage and emits metrics that feed health-based recovery.

Bottlenecks & scaling

Scaling from one node to 100k GPUs is a march from one bottleneck to the next. Each shows up as a dip in MFU; the mitigation usually trades one resource for another.

Bottleneck Symptom Mitigation
Communication overhead MFU falls as DP grows; GPUs idle on all-reduce Overlap comm with compute, gradient bucketing, hierarchical all-reduce, SHARP
Memory wall OOM; model + optimizer + activations exceed 80 GB ZeRO/FSDP sharding, TP, activation checkpointing, CPU/NVMe offload
Stragglers One slow rank gates every collective; uneven step-time Per-rank monitoring, quarantine bad HW, backup workers, rebalance
Checkpoint I/O Training stalls for minutes during writes Async + in-memory snapshot, sharded parallel writes, tuned interval
Network topology Oversubscribed switches; cross-rail traffic congests Rail-optimized / non-blocking fabric, topology-aware placement, locality
Pipeline bubble Stages idle during fill/drain; low PP efficiency More micro-batches, interleaved 1F1B schedule, balanced stage sizing

Summary

Training across ~100k GPUs is fundamentally a memory-vs-communication-vs-compute balancing act. Shard the 16-bytes-per-param state with ZeRO/FSDP and TP to beat the 80 GB memory wall; compose data, tensor, and pipeline parallelism and pin each axis to the right network tier (TP on NVLink, PP/DP on InfiniBand); hide NCCL collectives behind compute with overlap and bucketing to defend MFU; and treat failure as routine with async sharded checkpointing, hot spares, and elastic recovery. Get those four right and the fleet stays busy — which, at this price per GPU-hour, is the entire game.