AI / ML Infrastructure

Distributed GPU Training Cluster

How do you train a 100B-1T parameter model across ~100,000 GPUs and keep them all busy? At this scale the model no longer fits on a single GPU, the network — not the arithmetic — becomes the limiter, and hardware fails every few hours. The job is a careful composition of three axes of parallelism, bandwidth-optimal collective communication, and relentless fault tolerance, all tuned to push model-FLOPs-utilization (MFU) as high as it will go. Idle GPUs burn money by the minute, so every decision trades memory against communication against compute.

Requirements

A training platform is a batch system, not a request/response service: there are no user-facing latency SLAs, but there is a brutal cost clock. The product is throughput per dollar — how many useful model-FLOPs you extract from the fleet before the next failure forces a restart.

Functional requirements

Train very large models (dense 100B-1T+ params, or larger MoE) across ~100k accelerators as a single logical job.
Support 3D parallelism — data (DP), tensor (TP), and pipeline (PP) — composed with ZeRO / FSDP sharding of optimizer state, gradients, and parameters.
Gang scheduling: a job is all-or-nothing. Either every rank in the process group is placed (topology-aware) or none start.
Checkpoint and resume: persist weights, optimizer state, RNG, data-loader cursor, and step number; restart deterministically from the last good step.
Data pipeline: shard and stream a multi-trillion-token corpus to every rank with deterministic ordering for reproducibility.
Observability: per-step loss, gradient norm, throughput (tokens/s), MFU, and per-rank step-time for straggler detection; experiment tracking of config, code, and data versions.
Elastic membership: tolerate workers leaving (failure) and joining (hot spares) and re-form the process group without a full job resubmit.

Non-functional requirements

Throughput / MFU: keep effective utilization high — 35-55% MFU is considered good at the 10k-100k GPU scale; every point of MFU is millions of dollars of compute.
Fault tolerance: hardware failure is the steady state, not the exception. Detect, isolate, and recover in minutes — not hours.
Elasticity: survive the loss of a node or rail without throwing away all in-flight work; swap in spares transparently where possible.
Reproducibility: given the same seed, data order, and config, a run should reproduce (at least statistically; ideally bit-wise on identical hardware) — essential for debugging divergence and loss spikes.
Scaling efficiency: near-linear speedup as GPUs are added — which means keeping communication overlapped and off the critical path.
Cost / utilization: the fleet is the single most expensive line item in the company. An idle GPU is pure loss, so scheduling, placement, and recovery all optimize for "GPUs never wait".
Topology-aware networking: a non-blocking / rail-optimized fabric so that collectives are not bottlenecked by oversubscribed switches.

Scale & back-of-the-envelope

Three numbers drive every later decision: the memory a model needs versus what a GPU has, the bandwidth hierarchy (HBM > NVLink > InfiniBand), and the failure rate of a 100k-GPU fleet. Assume H100-class accelerators throughout.

Quantity	Figure	Math / why it matters
Fleet size	`~100,000` H100 SXM GPUs	~12,500 8-GPU nodes on a rail-optimized fabric.
Peak compute / GPU	`~1 PFLOP/s` BF16 (dense ~990 TFLOP/s)	Fleet peak ~`100 EFLOP/s`. Real work is MFU x peak.
Effective compute	MFU `35-55%`	At 40% MFU: `~40 EFLOP/s` usable across the fleet.
HBM per GPU	`80 GB` HBM3	Model shard + activations + optimizer slice must all fit here.
HBM bandwidth	`~3.35 TB/s` (H100; ~2 TB/s on A100)	The per-GPU "memory wall" that gates kernel throughput.
NVLink (intra-node)	`~900 GB/s` per GPU via NVSwitch	~25x faster than IB — host the chatty `TP` here.
InfiniBand (inter-node)	`~400 Gb/s` per GPU (NDR), ~3.2 Tb/s/node	Host point-to-point `PP` and bursty `DP` here.
Model + optimizer state	`~16 bytes / param`	bf16 weights (2) + bf16 grads (2) + fp32 master (4) + Adam m (4) + Adam v (4) = `16`. 1T params -> `16 TB`; 175B -> `2.8 TB`.
Min GPUs to hold 1T model	`~200` GPUs (state only)	16 TB / 80 GB ~ 200, before any activations -> you must shard.
Global batch	`~4M tokens / step`	e.g. seq-len 8k x ~512 sequences; sized for convergence, not just HW.
Throughput	`~5.8e11 tokens/day` (~576B)	`C = 6 x N x D` (~6 FLOPs/param/token). 4e19 FLOP/s / (6 x 1e12) ~ 6.7M tok/s.
Train 15T tokens	`~26 days`	15e12 / 5.76e11 ~ 26 days of wall-clock on a healthy fleet.
Checkpoint size	`~16 TB` (1T model, full state)	Naive single-stream write = hours -> needs sharded async I/O.
Fleet MTBF	`node_MTBF / N`	5e4 h per node / 1e5 nodes ~ `0.5 h` -> a failure roughly every 30 minutes.

The two walls

A 1T-parameter model needs ~16 TB of state but a GPU has 80 GB — a 200x memory gap. And because a 100k-GPU fleet fails roughly every 30 minutes, you cannot run for 26 days without treating failure as routine. Sharding answers the first wall; checkpointing + elasticity answers the second.

Parallelism: the core deep dive

No single technique scales a 1T model to 100k GPUs. You compose three orthogonal axes of parallelism, then shard the leftover state with ZeRO/FSDP. The art is mapping each axis onto the matching tier of the network so its communication stays cheap.

The four techniques

Data Parallel (DP) — replicate the whole model on each worker, split the batch. After the backward pass, all-reduce the gradients (2 x params bytes) so every replica applies the same update. Simple and scales the batch, but each replica must hold the entire model+optimizer, and the global batch cannot grow without hurting convergence.
Tensor Parallel (TP) — split inside each layer: shard the attention heads and the MLP matmuls column/row-wise across GPUs. Every forward and backward layer needs an all-reduce / all-gather within the TP group — extremely chatty — so keep the TP degree at 8 (one node) and run it over NVLink. Cuts both weight and activation memory.
Pipeline Parallel (PP) — split the layers into stages placed on different GPUs; micro-batches flow through like an assembly line. Communication is cheap point-to-point activation passing between adjacent stages — great across nodes. The cost is the pipeline bubble: idle time during fill/drain, roughly (stages - 1) / (micro_batches + stages - 1). Shrink it with many micro-batches and an interleaved 1F1B schedule.
ZeRO / FSDP sharding — instead of replicating across the DP axis, shard the optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3 / FSDP) across DP ranks. Per-GPU memory drops nearly linearly with the DP degree. The cost: an all-gather of parameters just before each layer runs, plus a reduce-scatter of gradients after — overlapped with compute so it largely hides.

3D parallelism: composing the axes

Total GPUs = TP x PP x DP. A concrete 100k-GPU mapping might be TP=8 (one node, NVLink) x PP=16 stages (across nodes, IB) x DP=781 replicas (across the rest, IB) = ~99,968 GPUs, with ZeRO sharding layered on the DP axis to fit the optimizer state.

flowchart TD
  C["Cluster: 100k GPUs = TP x PP x DP"]
  C --> DP["Data-Parallel axis (DP = 781 replicas)"]
  DP --> REP["One full model replica"]
  REP --> PP["Pipeline-Parallel axis (PP = 16 stages)"]
  PP --> ST["One stage = a slab of layers"]
  ST --> TP["Tensor-Parallel axis (TP = 8, one node)"]
  TP --> N0["GPU 0 (NVLink)"]
  TP --> N1["GPU 1 (NVLink)"]
  TP --> N7["GPU 7 (NVLink)"]
  ZeRO["ZeRO / FSDP shards optimizer+grads+params across DP"] -.-> DP

3D parallelism nests the axes: tensor-parallel inside a node, pipeline stages and data-parallel replicas across nodes, ZeRO sharding folded onto the DP axis.

Axis	Splits	Communication	Network tier	Main cost / limit
Data (DP)	The batch (replicas)	`all-reduce` of grads, once/step	InfiniBand	Global batch ceiling; full model must fit (unless ZeRO)
Tensor (TP)	Inside each layer (matmuls)	`all-reduce` every layer	NVLink (intra-node)	Very chatty -> keep degree at 8 (one node)
Pipeline (PP)	Layers into stages	Point-to-point activations	InfiniBand	Pipeline bubble; needs many micro-batches
ZeRO / FSDP	Optimizer / grad / param shards	`all-gather` + `reduce-scatter`	NVLink + IB	Extra gather per layer; must overlap with compute

Map parallelism to the network

The golden rule: put the chattiest communication on the fastest link. Tensor-parallel all-reduces hit every layer, so they live on NVLink inside a node. Pipeline activations are tiny and point-to-point, so they ride InfiniBand between nodes. Data-parallel all-reduce is large but happens once per step, so it can also span nodes — and overlaps with the backward pass. Mismatching an axis to the wrong tier is the fastest way to tank MFU.

Collective communication

Every parallelism axis ultimately reduces to a handful of collectives executed by NCCL (NVIDIA Collective Communications Library). Their cost, and how well they overlap with compute, sets the ceiling on scaling efficiency.

all-reduce — sum a tensor across all ranks; everyone ends with the total. This is the DP gradient sync. It decomposes into reduce-scatter then all-gather.
all-gather — each rank ends with the concatenation of every shard (FSDP parameter gather before a layer; TP output collection).
reduce-scatter — each rank ends with a reduced slice (FSDP/ZeRO gradient reduction).

Ring vs tree all-reduce

Ring all-reduce is bandwidth-optimal: arrange ranks in a ring, and each sends 2 x (N-1)/N of the data — independent of fleet size for the bandwidth term, but latency grows with the ring length. Tree / double-binary-tree all-reduce is latency-optimal at O(log N) hops — better for small messages or very large N. NCCL auto-selects the algorithm by message size and topology, and can offload the reduction into the switch with SHARP (in-network compute).

flowchart LR
  A["GPU 0"] --> B["GPU 1"]
  B --> C["GPU 2"]
  C --> D["GPU 3"]
  D --> A

Ring all-reduce: data flows one direction around the ring as N-1 reduce-scatter steps then N-1 all-gather steps, saturating every link's bandwidth.

Hiding communication under compute

Overlap: launch the gradient all-reduce on a separate CUDA stream so it runs while the backward pass of earlier layers is still computing. Done well, comm nearly disappears behind compute.
Gradient bucketing: coalesce gradients into ~25 MB buckets and fire the all-reduce the instant a bucket fills — rather than waiting for the whole backward pass. Big buckets amortize launch latency; too big and overlap suffers.
Hierarchical / 2-level all-reduce: reduce within a node over NVLink first, then across nodes over IB, then broadcast back down. Far fewer cross-node bytes.
Prefetch: FSDP all-gathers the next layer's parameters while the current layer computes, so weights arrive just in time.

Why this dominates at scale

Compute per GPU is fixed, but data-parallel communication volume per step is roughly constant while the number of participants grows. As you scale DP, the all-reduce becomes the long pole — which is exactly why overlap, bucketing, and topology-aware NCCL are the difference between 50% and 20% MFU.

Fault tolerance

At 100k GPUs a hardware fault lands roughly every 30 minutes to a few hours: GPU ECC / Xid errors, NVLink or NVSwitch faults, NIC/optics flaps, host OOM, thermal throttling, network partitions, and the nastiest of all — silent data corruption (SDC) that produces wrong numbers without crashing. In synchronous training a single dead rank stalls the entire job, so recovery speed is throughput.

The toolbox

Checkpoint / restart — periodically persist weights, optimizer state, RNG seeds, the data-loader cursor, and the step number. On any failure the gang restarts from the last good step. The classic recovery primitive.
Optimal interval — checkpoint too rarely and a crash wastes hours of work; too often and the writes themselves stall training. The Young/Daly rule puts the sweet spot near sqrt(2 x checkpoint_cost x MTBF).
In-memory / async checkpointing — snapshot state to host RAM or a peer GPU in seconds, then flush to durable storage in the background. Training stalls only for the fast in-memory copy, not the slow disk write.
Sharded checkpointing — every rank writes its own shard in parallel to a parallel filesystem or object store, so aggregate write bandwidth scales with the fleet and a 16 TB checkpoint lands in seconds, not hours.
Elastic training — frameworks (e.g. TorchElastic) re-rendezvous the process group at a new world size after a loss, continuing from a reshardable checkpoint instead of resubmitting the job.
Hot spares — keep a small pool of idle nodes; on failure, swap a spare into the dead rank's slot and reload only that shard, avoiding a full re-schedule.
Straggler mitigation — every all-reduce runs at the speed of the slowest rank. Track per-rank step-time, quarantine throttling/ECC-retrying hardware, and rebalance. Detect SDC with periodic deterministic recompute and checksums.

sequenceDiagram
  participant W as Worker Rank
  participant HM as Health Monitor
  participant SCH as Scheduler
  participant CK as Checkpoint Store
  W->>HM: heartbeat every 5s
  Note over W,HM: GPU Xid error, heartbeat times out
  HM->>SCH: report dead rank
  SCH->>SCH: pause collectives, swap in hot spare
  SCH->>CK: request latest sharded checkpoint
  CK-->>W: stream weights + optimizer shard
  W->>SCH: ready, rejoin process group
  SCH->>W: resume from last good step

Failure -> detect -> restore: heartbeat loss triggers the scheduler to swap a hot spare in and rehydrate its shard from the last checkpoint before the gang resumes.

Recovery time is throughput

If recovery takes 20 minutes and you fail every 2 hours, you lose ~17% of wall-clock to downtime — before counting the work redone since the last checkpoint. Async in-memory checkpointing plus hot-spare swaps shrink that window from tens of minutes to under a minute, which is worth more MFU than most kernel optimizations.

High-level design

The platform splits into a control plane (scheduling and health), a data pipeline (feeding tokens), the training fabric (the GPUs and their collectives), and shared storage (checkpoints) wrapped in observability.

flowchart TD
  subgraph CTRL["Control plane"]
    SCH["Gang scheduler (topology-aware)"]
    HM["Health monitor / heartbeats"]
    TRK["Experiment tracker"]
  end
  subgraph DATA["Data pipeline"]
    OBJ["Sharded dataset (object store)"]
    LD["Streaming dataloaders"]
  end
  subgraph FAB["Training fabric (100k GPUs)"]
    WG0["Worker group 0"]
    WG1["Worker group 1"]
    WGN["Worker group N"]
  end
  CKPT["Checkpoint store (parallel FS + object)"]
  MON["Metrics: MFU, loss, grad-norm, step-time"]

  SCH --> WG0
  SCH --> WG1
  SCH --> WGN
  OBJ --> LD
  LD --> WG0
  LD --> WG1
  LD --> WGN
  WG0 --> CKPT
  WG1 --> CKPT
  WGN --> CKPT
  WG0 --> MON
  WG1 --> MON
  WGN --> MON
  MON --> HM
  HM --> SCH
  SCH --> TRK

Control plane schedules and watches; dataloaders stream shards into worker groups; every group checkpoints to shared storage and emits metrics that feed health-based recovery.

Gang scheduler — places the whole job atomically with topology awareness (co-locate TP ranks on one NVLink domain; keep PP/DP within the same IB rail where possible), manages quotas and preemption, and coordinates elastic re-rendezvous.
Worker groups — the GPUs themselves, organized by the 3D parallel mapping. Each group runs the forward/backward loop and the NCCL collectives described above.
Data pipeline — the corpus is pre-tokenized and sharded in an object store; streaming dataloaders pull deterministically (by rank + step) so order is reproducible and no rank starves the GPUs.
Checkpoint store — a parallel filesystem or object store sized for sharded, high-bandwidth, async writes and fast parallel reads on restore.
Health monitor + metrics — heartbeats and per-rank step-time drive failure and straggler detection; the loop closes back to the scheduler for automatic recovery.

Bottlenecks & scaling

Scaling from one node to 100k GPUs is a march from one bottleneck to the next. Each shows up as a dip in MFU; the mitigation usually trades one resource for another.

Bottleneck	Symptom	Mitigation
Communication overhead	MFU falls as DP grows; GPUs idle on all-reduce	Overlap comm with compute, gradient bucketing, hierarchical all-reduce, SHARP
Memory wall	OOM; model + optimizer + activations exceed 80 GB	ZeRO/FSDP sharding, TP, activation checkpointing, CPU/NVMe offload
Stragglers	One slow rank gates every collective; uneven step-time	Per-rank monitoring, quarantine bad HW, backup workers, rebalance
Checkpoint I/O	Training stalls for minutes during writes	Async + in-memory snapshot, sharded parallel writes, tuned interval
Network topology	Oversubscribed switches; cross-rail traffic congests	Rail-optimized / non-blocking fabric, topology-aware placement, locality
Pipeline bubble	Stages idle during fill/drain; low PP efficiency	More micro-batches, interleaved 1F1B schedule, balanced stage sizing

Summary

Training across ~100k GPUs is fundamentally a memory-vs-communication-vs-compute balancing act. Shard the 16-bytes-per-param state with ZeRO/FSDP and TP to beat the 80 GB memory wall; compose data, tensor, and pipeline parallelism and pin each axis to the right network tier (TP on NVLink, PP/DP on InfiniBand); hide NCCL collectives behind compute with overlap and bucketing to defend MFU; and treat failure as routine with async sharded checkpointing, hot spares, and elastic recovery. Get those four right and the fleet stays busy — which, at this price per GPU-hour, is the entire game.