AI / ML Infrastructure
Distributed GPU Training Cluster
How do you train a 100B-1T parameter model across
~100,000 GPUs and keep them all busy? At this scale
the model no longer fits on a single GPU, the network — not
the arithmetic — becomes the limiter, and hardware fails every few
hours. The job is a careful composition of three axes of parallelism,
bandwidth-optimal collective communication, and relentless fault
tolerance, all tuned to push
model-FLOPs-utilization (MFU) as high as it will go.
Idle GPUs burn money by the minute, so every decision trades
memory against communication against
compute.
Requirements
A training platform is a batch system, not a request/response service: there are no user-facing latency SLAs, but there is a brutal cost clock. The product is throughput per dollar — how many useful model-FLOPs you extract from the fleet before the next failure forces a restart.
Functional requirements
-
Train very large models (dense
100B-1T+params, or larger MoE) across~100kaccelerators as a single logical job. - Support 3D parallelism — data (DP), tensor (TP), and pipeline (PP) — composed with ZeRO / FSDP sharding of optimizer state, gradients, and parameters.
- Gang scheduling: a job is all-or-nothing. Either every rank in the process group is placed (topology-aware) or none start.
- Checkpoint and resume: persist weights, optimizer state, RNG, data-loader cursor, and step number; restart deterministically from the last good step.
- Data pipeline: shard and stream a multi-trillion-token corpus to every rank with deterministic ordering for reproducibility.
- Observability: per-step loss, gradient norm, throughput (tokens/s), MFU, and per-rank step-time for straggler detection; experiment tracking of config, code, and data versions.
- Elastic membership: tolerate workers leaving (failure) and joining (hot spares) and re-form the process group without a full job resubmit.
Non-functional requirements
-
Throughput / MFU: keep effective utilization high —
35-55%MFU is considered good at the 10k-100k GPU scale; every point of MFU is millions of dollars of compute. - Fault tolerance: hardware failure is the steady state, not the exception. Detect, isolate, and recover in minutes — not hours.
- Elasticity: survive the loss of a node or rail without throwing away all in-flight work; swap in spares transparently where possible.
- Reproducibility: given the same seed, data order, and config, a run should reproduce (at least statistically; ideally bit-wise on identical hardware) — essential for debugging divergence and loss spikes.
- Scaling efficiency: near-linear speedup as GPUs are added — which means keeping communication overlapped and off the critical path.
- Cost / utilization: the fleet is the single most expensive line item in the company. An idle GPU is pure loss, so scheduling, placement, and recovery all optimize for "GPUs never wait".
- Topology-aware networking: a non-blocking / rail-optimized fabric so that collectives are not bottlenecked by oversubscribed switches.
Scale & back-of-the-envelope
Three numbers drive every later decision: the memory a model needs versus what a GPU has, the bandwidth hierarchy (HBM > NVLink > InfiniBand), and the failure rate of a 100k-GPU fleet. Assume H100-class accelerators throughout.
| Quantity | Figure | Math / why it matters |
|---|---|---|
| Fleet size | ~100,000 H100 SXM GPUs |
~12,500 8-GPU nodes on a rail-optimized fabric. |
| Peak compute / GPU | ~1 PFLOP/s BF16 (dense ~990 TFLOP/s) |
Fleet peak ~100 EFLOP/s. Real work is MFU x peak.
|
| Effective compute | MFU 35-55% |
At 40% MFU: ~40 EFLOP/s usable across the fleet.
|
| HBM per GPU | 80 GB HBM3 |
Model shard + activations + optimizer slice must all fit here. |
| HBM bandwidth | ~3.35 TB/s (H100; ~2 TB/s on A100) |
The per-GPU "memory wall" that gates kernel throughput. |
| NVLink (intra-node) | ~900 GB/s per GPU via NVSwitch |
~25x faster than IB — host the chatty TP here.
|
| InfiniBand (inter-node) | ~400 Gb/s per GPU (NDR), ~3.2 Tb/s/node |
Host point-to-point PP and bursty
DP here.
|
| Model + optimizer state | ~16 bytes / param |
bf16 weights (2) + bf16 grads (2) + fp32 master (4) + Adam m
(4) + Adam v (4) = 16. 1T params ->
16 TB; 175B -> 2.8 TB.
|
| Min GPUs to hold 1T model | ~200 GPUs (state only) |
16 TB / 80 GB ~ 200, before any activations -> you must shard. |
| Global batch | ~4M tokens / step |
e.g. seq-len 8k x ~512 sequences; sized for convergence, not just HW. |
| Throughput | ~5.8e11 tokens/day (~576B) |
C = 6 x N x D (~6 FLOPs/param/token). 4e19 FLOP/s
/ (6 x 1e12) ~ 6.7M tok/s.
|
| Train 15T tokens | ~26 days |
15e12 / 5.76e11 ~ 26 days of wall-clock on a healthy fleet. |
| Checkpoint size | ~16 TB (1T model, full state) |
Naive single-stream write = hours -> needs sharded async I/O. |
| Fleet MTBF | node_MTBF / N |
5e4 h per node / 1e5 nodes ~ 0.5 h -> a
failure roughly every 30 minutes.
|
The two walls
A 1T-parameter model needs ~16 TB of state but a GPU
has 80 GB — a 200x memory gap. And
because a 100k-GPU fleet fails roughly every 30 minutes, you cannot
run for 26 days without treating failure as routine.
Sharding answers the first wall;
checkpointing + elasticity answers the second.
Parallelism: the core deep dive
No single technique scales a 1T model to 100k GPUs. You compose three orthogonal axes of parallelism, then shard the leftover state with ZeRO/FSDP. The art is mapping each axis onto the matching tier of the network so its communication stays cheap.
The four techniques
-
Data Parallel (DP) — replicate the whole model on
each worker, split the batch. After the backward pass,
all-reducethe gradients (2 x paramsbytes) so every replica applies the same update. Simple and scales the batch, but each replica must hold the entire model+optimizer, and the global batch cannot grow without hurting convergence. -
Tensor Parallel (TP) — split inside each
layer: shard the attention heads and the MLP matmuls column/row-wise
across GPUs. Every forward and backward layer needs an
all-reduce/all-gatherwithin the TP group — extremely chatty — so keep the TP degree at 8 (one node) and run it over NVLink. Cuts both weight and activation memory. -
Pipeline Parallel (PP) — split the
layers into stages placed on different GPUs; micro-batches
flow through like an assembly line. Communication is cheap
point-to-point activation passing between adjacent
stages — great across nodes. The cost is the
pipeline bubble: idle time during fill/drain,
roughly
(stages - 1) / (micro_batches + stages - 1). Shrink it with many micro-batches and an interleaved1F1Bschedule. -
ZeRO / FSDP sharding — instead of replicating
across the DP axis, shard the optimizer states (ZeRO-1),
gradients (ZeRO-2), and parameters (ZeRO-3 / FSDP) across DP ranks.
Per-GPU memory drops nearly linearly with the DP degree. The cost:
an
all-gatherof parameters just before each layer runs, plus areduce-scatterof gradients after — overlapped with compute so it largely hides.
3D parallelism: composing the axes
Total GPUs = TP x PP x DP. A concrete 100k-GPU mapping
might be TP=8 (one node, NVLink) x
PP=16 stages (across nodes, IB) x
DP=781 replicas (across the rest, IB) =
~99,968 GPUs, with ZeRO sharding layered on the DP axis
to fit the optimizer state.
flowchart TD
C["Cluster: 100k GPUs = TP x PP x DP"]
C --> DP["Data-Parallel axis (DP = 781 replicas)"]
DP --> REP["One full model replica"]
REP --> PP["Pipeline-Parallel axis (PP = 16 stages)"]
PP --> ST["One stage = a slab of layers"]
ST --> TP["Tensor-Parallel axis (TP = 8, one node)"]
TP --> N0["GPU 0 (NVLink)"]
TP --> N1["GPU 1 (NVLink)"]
TP --> N7["GPU 7 (NVLink)"]
ZeRO["ZeRO / FSDP shards optimizer+grads+params across DP"] -.-> DP
3D parallelism nests the axes: tensor-parallel inside a node, pipeline stages and data-parallel replicas across nodes, ZeRO sharding folded onto the DP axis.
| Axis | Splits | Communication | Network tier | Main cost / limit |
|---|---|---|---|---|
| Data (DP) | The batch (replicas) | all-reduce of grads, once/step |
InfiniBand | Global batch ceiling; full model must fit (unless ZeRO) |
| Tensor (TP) | Inside each layer (matmuls) | all-reduce every layer |
NVLink (intra-node) | Very chatty -> keep degree at 8 (one node) |
| Pipeline (PP) | Layers into stages | Point-to-point activations | InfiniBand | Pipeline bubble; needs many micro-batches |
| ZeRO / FSDP | Optimizer / grad / param shards | all-gather + reduce-scatter |
NVLink + IB | Extra gather per layer; must overlap with compute |
Map parallelism to the network
The golden rule: put the chattiest communication on
the fastest link. Tensor-parallel all-reduces hit
every layer, so they live on NVLink inside a node.
Pipeline activations are tiny and point-to-point, so they ride
InfiniBand between nodes. Data-parallel all-reduce is
large but happens once per step, so it can also span nodes — and
overlaps with the backward pass. Mismatching an axis to the wrong
tier is the fastest way to tank MFU.
Collective communication
Every parallelism axis ultimately reduces to a handful of
collectives executed by NCCL (NVIDIA
Collective Communications Library). Their cost, and how well they
overlap with compute, sets the ceiling on scaling efficiency.
-
all-reduce — sum a tensor across all ranks;
everyone ends with the total. This is the DP gradient sync. It
decomposes into
reduce-scatterthenall-gather. - all-gather — each rank ends with the concatenation of every shard (FSDP parameter gather before a layer; TP output collection).
- reduce-scatter — each rank ends with a reduced slice (FSDP/ZeRO gradient reduction).
Ring vs tree all-reduce
Ring all-reduce is bandwidth-optimal:
arrange ranks in a ring, and each sends 2 x (N-1)/N of
the data — independent of fleet size for the bandwidth term, but
latency grows with the ring length.
Tree / double-binary-tree all-reduce is
latency-optimal at O(log N) hops — better for
small messages or very large N. NCCL auto-selects the
algorithm by message size and topology, and can offload the reduction
into the switch with SHARP (in-network compute).
flowchart LR
A["GPU 0"] --> B["GPU 1"]
B --> C["GPU 2"]
C --> D["GPU 3"]
D --> A
Ring all-reduce: data flows one direction around the ring as N-1 reduce-scatter steps then N-1 all-gather steps, saturating every link's bandwidth.
Hiding communication under compute
- Overlap: launch the gradient all-reduce on a separate CUDA stream so it runs while the backward pass of earlier layers is still computing. Done well, comm nearly disappears behind compute.
- Gradient bucketing: coalesce gradients into ~25 MB buckets and fire the all-reduce the instant a bucket fills — rather than waiting for the whole backward pass. Big buckets amortize launch latency; too big and overlap suffers.
- Hierarchical / 2-level all-reduce: reduce within a node over NVLink first, then across nodes over IB, then broadcast back down. Far fewer cross-node bytes.
- Prefetch: FSDP all-gathers the next layer's parameters while the current layer computes, so weights arrive just in time.
Why this dominates at scale
Compute per GPU is fixed, but data-parallel communication volume per
step is roughly constant while the number of participants grows. As
you scale DP, the all-reduce becomes the long pole — which is
exactly why overlap, bucketing, and topology-aware NCCL are the
difference between 50% and 20% MFU.
Fault tolerance
At 100k GPUs a hardware fault lands roughly every 30 minutes to a few hours: GPU ECC / Xid errors, NVLink or NVSwitch faults, NIC/optics flaps, host OOM, thermal throttling, network partitions, and the nastiest of all — silent data corruption (SDC) that produces wrong numbers without crashing. In synchronous training a single dead rank stalls the entire job, so recovery speed is throughput.
The toolbox
- Checkpoint / restart — periodically persist weights, optimizer state, RNG seeds, the data-loader cursor, and the step number. On any failure the gang restarts from the last good step. The classic recovery primitive.
-
Optimal interval — checkpoint too rarely and a
crash wastes hours of work; too often and the writes themselves
stall training. The Young/Daly rule puts the sweet spot near
sqrt(2 x checkpoint_cost x MTBF). - In-memory / async checkpointing — snapshot state to host RAM or a peer GPU in seconds, then flush to durable storage in the background. Training stalls only for the fast in-memory copy, not the slow disk write.
- Sharded checkpointing — every rank writes its own shard in parallel to a parallel filesystem or object store, so aggregate write bandwidth scales with the fleet and a 16 TB checkpoint lands in seconds, not hours.
- Elastic training — frameworks (e.g. TorchElastic) re-rendezvous the process group at a new world size after a loss, continuing from a reshardable checkpoint instead of resubmitting the job.
- Hot spares — keep a small pool of idle nodes; on failure, swap a spare into the dead rank's slot and reload only that shard, avoiding a full re-schedule.
- Straggler mitigation — every all-reduce runs at the speed of the slowest rank. Track per-rank step-time, quarantine throttling/ECC-retrying hardware, and rebalance. Detect SDC with periodic deterministic recompute and checksums.
sequenceDiagram
participant W as Worker Rank
participant HM as Health Monitor
participant SCH as Scheduler
participant CK as Checkpoint Store
W->>HM: heartbeat every 5s
Note over W,HM: GPU Xid error, heartbeat times out
HM->>SCH: report dead rank
SCH->>SCH: pause collectives, swap in hot spare
SCH->>CK: request latest sharded checkpoint
CK-->>W: stream weights + optimizer shard
W->>SCH: ready, rejoin process group
SCH->>W: resume from last good step
Failure -> detect -> restore: heartbeat loss triggers the scheduler to swap a hot spare in and rehydrate its shard from the last checkpoint before the gang resumes.
Recovery time is throughput
If recovery takes 20 minutes and you fail every 2 hours, you lose
~17% of wall-clock to downtime — before counting the
work redone since the last checkpoint. Async in-memory checkpointing
plus hot-spare swaps shrink that window from
tens of minutes to under a minute, which is worth
more MFU than most kernel optimizations.
High-level design
The platform splits into a control plane (scheduling and health), a data pipeline (feeding tokens), the training fabric (the GPUs and their collectives), and shared storage (checkpoints) wrapped in observability.
flowchart TD
subgraph CTRL["Control plane"]
SCH["Gang scheduler (topology-aware)"]
HM["Health monitor / heartbeats"]
TRK["Experiment tracker"]
end
subgraph DATA["Data pipeline"]
OBJ["Sharded dataset (object store)"]
LD["Streaming dataloaders"]
end
subgraph FAB["Training fabric (100k GPUs)"]
WG0["Worker group 0"]
WG1["Worker group 1"]
WGN["Worker group N"]
end
CKPT["Checkpoint store (parallel FS + object)"]
MON["Metrics: MFU, loss, grad-norm, step-time"]
SCH --> WG0
SCH --> WG1
SCH --> WGN
OBJ --> LD
LD --> WG0
LD --> WG1
LD --> WGN
WG0 --> CKPT
WG1 --> CKPT
WGN --> CKPT
WG0 --> MON
WG1 --> MON
WGN --> MON
MON --> HM
HM --> SCH
SCH --> TRK
Control plane schedules and watches; dataloaders stream shards into worker groups; every group checkpoints to shared storage and emits metrics that feed health-based recovery.
- Gang scheduler — places the whole job atomically with topology awareness (co-locate TP ranks on one NVLink domain; keep PP/DP within the same IB rail where possible), manages quotas and preemption, and coordinates elastic re-rendezvous.
- Worker groups — the GPUs themselves, organized by the 3D parallel mapping. Each group runs the forward/backward loop and the NCCL collectives described above.
- Data pipeline — the corpus is pre-tokenized and sharded in an object store; streaming dataloaders pull deterministically (by rank + step) so order is reproducible and no rank starves the GPUs.
- Checkpoint store — a parallel filesystem or object store sized for sharded, high-bandwidth, async writes and fast parallel reads on restore.
- Health monitor + metrics — heartbeats and per-rank step-time drive failure and straggler detection; the loop closes back to the scheduler for automatic recovery.
Bottlenecks & scaling
Scaling from one node to 100k GPUs is a march from one bottleneck to the next. Each shows up as a dip in MFU; the mitigation usually trades one resource for another.
| Bottleneck | Symptom | Mitigation |
|---|---|---|
| Communication overhead | MFU falls as DP grows; GPUs idle on all-reduce | Overlap comm with compute, gradient bucketing, hierarchical all-reduce, SHARP |
| Memory wall | OOM; model + optimizer + activations exceed 80 GB | ZeRO/FSDP sharding, TP, activation checkpointing, CPU/NVMe offload |
| Stragglers | One slow rank gates every collective; uneven step-time | Per-rank monitoring, quarantine bad HW, backup workers, rebalance |
| Checkpoint I/O | Training stalls for minutes during writes | Async + in-memory snapshot, sharded parallel writes, tuned interval |
| Network topology | Oversubscribed switches; cross-rail traffic congests | Rail-optimized / non-blocking fabric, topology-aware placement, locality |
| Pipeline bubble | Stages idle during fill/drain; low PP efficiency | More micro-batches, interleaved 1F1B schedule, balanced stage sizing |
Summary
Training across ~100k GPUs is fundamentally a
memory-vs-communication-vs-compute balancing act.
Shard the 16-bytes-per-param state with ZeRO/FSDP
and TP to beat the 80 GB memory wall; compose data,
tensor, and pipeline parallelism and pin each axis to the right
network tier (TP on NVLink, PP/DP on InfiniBand);
hide NCCL collectives behind compute with overlap
and bucketing to defend MFU; and treat
failure as routine
with async sharded checkpointing, hot spares, and elastic recovery.
Get those four right and the fleet stays busy — which, at this price
per GPU-hour, is the entire game.