System Design Notes All designs

AI / ML Infrastructure

GPU Health Monitoring & Failure Detection

A modern training run spreads one tightly-coupled computation across tens of thousands of GPUs, and the whole job moves at the speed of its sickest device. A single accelerator that throws an ECC error, falls off the PCIe bus, or simply runs slow can stall or crash a 100,000-GPU job, vaporizing hours of compute across the entire fleet. Worse, a GPU can return wrong answers without crashingsilent data corruption (SDC) — quietly poisoning a model for days. A GPU health system continuously collects telemetry from every device, detects both loud and silent failures, pinpoints the exact culprit, and automatically drains and remediates it so the job keeps moving. This is the immune system behind NVIDIA DCGM, the DGX SuperPOD, and every hyperscaler's GPU fleet.

Requirements

Frame the system as a fleet immune system: it watches every GPU, recognizes the early symptoms of disease, isolates the infected node before it spreads, and heals it back into service — all without taking down the patient (the training job) it is trying to protect.

Why it matters

Distributed training is bulk-synchronous: every rank must reach each all-reduce barrier together, so one bad GPU stalls all 100k. A hard crash is actually the easy case — you notice immediately and restart from a checkpoint. The nightmare is silent data corruption: a GPU that quietly computes 2 + 2 = 5 produces no error, no crash, no log line — it just corrupts gradients for days until someone notices the loss curve is wrong and burns a week of fleet time bisecting which node lied. Silent failure is far worse than a clean crash.

Functional Non-functional
Collect telemetry from every GPU/host — temperature, power, clocks, ECC errors, XID events, NVLink/PCIe errors, SM utilization, HBM row-remap. Fast detection / low MTTR — flag hard failures within seconds and complete remediation within minutes; MTTR dominates lost GPU-hours.
Detect hard failures (XID, double-bit ECC, fell-off-bus, NVLink down, thermal trip) and silent failures (SDC, stragglers). Low overhead — the agent must use negligible GPU/CPU/network; it cannot steal training cycles or perturb the very metrics it measures.
Locate the culprit — pinpoint the exact GPU / host / NVLink / cable, not just "the job died". Root-cause to a serial number. Low false-positive rate — a needless drain evicts a job and wastes ultra-expensive GPUs; flapping erodes trust in automation.
Drain & remediate — cordon the node, evacuate work, attempt GPU reset / reboot, escalate to RMA, and return healthy nodes to the pool. High recall — a missed SDC silently corrupts the model; for the worst class of fault, misses cost more than false alarms.
Keep training alive — signal the scheduler/job to exclude the bad node, restart from checkpoint, and swap in a hot spare. Scale to 100k GPUs — ingest and evaluate fleet-wide telemetry in near-real-time without a central choke point.
Dashboards, history & alerting — fleet health KPIs, per-node timelines, and on-call paging with deduplicated incidents. Reliable & isolated — monitoring must survive the failures it watches; prefer out-of-band paths (BMC/IPMI) that work even when the host is wedged.

Scale & estimates

Two numbers drive the design: the telemetry ingest rate (how much data per second the pipeline must absorb) and the fleet failure rate (how often something breaks). The first proves the pipeline must be a horizontally-scaled stream, not a database you poll. The second proves remediation must be automated — at this scale a human cannot keep up.

Dimension Estimate How / notes
GPUs in fleet 100,000 12,500 hosts × 8 GPUs. Plus NICs, NVSwitches, optics, PSUs — all monitored.
Metrics per GPU per sample ~30 fields DCGM exposes temp, power, SM/mem util, SM clocks, ECC SBE/DBE counts, XID, NVLink bandwidth + error counters, PCIe replay, HBM remapped rows, throttle reasons.
Sample interval 1-10 s Fast (1 s) for error counters and step-time; slow (10 s) for slow-moving temp/power. Adaptive: speed up around an incident.
Sustained ingest ~0.6-3 M points/s 100k × 30 / 5 s ≈ 600k/s; at 1 s ≈ 3 M/s. ≈ 50-260 B points/day.
Raw write volume ~1-8 TB/day ~25 bytes/point on the wire; far less at rest after delta-of-delta + compression in the TSDB.
Hardware events (XID/ECC) sparse but bursty Most samples are boring; a failing GPU emits a storm. Events must be lossless and ordered, unlike gauges which can drop a sample.
Fleet failure interval minutes Even at a per-accelerator MTBF of years, 100k devices means something fails every few minutes. Automation, not pagers.

Back-of-envelope: the MTBF wall

Meta's published Llama 3 405B run used 16,384 GPUs and hit 419 unexpected interruptions over 54 days — roughly one every ~3 hours, with about 78% traced to hardware (GPUs and HBM dominating). Scale that to 100k GPUs and the mean time between fleet failures collapses to minutes. The takeaway: at this scale failure is the steady state, so detection and remediation must be a fast, automated control loop, and jobs must checkpoint frequently enough that the lost work between failures stays small.

Failure signals & taxonomy

Good detection starts with knowing what a sick GPU looks like. Signals split into three buckets by how they announce themselves: hard failures shout (an error code, a dead link), silent data corruption says nothing but lies, and stragglers are alive and correct but slow. Each bucket needs a different detector and a different response.

flowchart TD
    S["GPU Signal"] --> H{"Hard, silent, or slow?"}
    H -->|hard| HF["XID / ECC DBE / Fell off bus / NVLink down / Thermal"]
    H -->|silent| SDC["Silent Data Corruption (wrong math)"]
    H -->|slow| STR["Straggler (slow step time)"]
    HF --> ACT1["Immediate cordon + drain"]
    SDC --> ACT2["Quarantine + verify (redundant compute)"]
    STR --> ACT3["Isolate + benchmark"]
      
Signal What it means Action
XID 79 — GPU fell off the bus The driver lost PCIe contact with the GPU — typically dead silicon, a bad seating, or power fault. Fatal. Cordon immediately, kill the job's tasks on that host, reset/reboot; escalate to RMA if it recurs.
XID 48 / 94 / 95 — double-bit ECC (DBE) An uncorrectable memory error in HBM. Data is already wrong; the GPU may halt the context. Treat as hard fault: drain, run row-remap, re-test. Persistent DBEs → RMA the board.
Single-bit ECC (SBE) rate Corrected automatically, but a rising SBE rate predicts an imminent DBE — a leading indicator. Watch the trend; pre-emptively schedule drain at the next checkpoint before it becomes uncorrectable.
HBM row-remap pending / failure (XID 63/64) Memory is retiring bad rows; a remap failure means it ran out of spare rows. Remap-pending needs a reset to take effect; remap-failure → RMA.
NVLink / NVSwitch error or down An intra-node GPU-to-GPU link degraded or dropped — all-reduce bandwidth craters or the collective hangs. Diagnose link vs switch vs cable; if a link stays down, drain the node (the gang can't run fast).
PCIe replay / AER errors Correctable bus errors; a rising replay rate signals a marginal slot, riser, or cable. Trend + threshold; schedule maintenance before it becomes a fell-off-bus.
Thermal throttle / over-temp / power cap The GPU is clocking down to protect itself — usually cooling (fan, pump, dust) or a hot rack, not the die. Alert facilities; the GPU is healthy but the job is slow — often a straggler root cause.
Straggler — high step time on one rank This GPU is correct but consistently the last to reach the barrier; it drags the whole job's throughput. Compare per-rank step time; isolate, benchmark (NCCL/burn-in), and evict if it stays slow.
Silent data corruption (SDC) The GPU computes wrong results with no error and no crash — the most dangerous failure mode. Catch via redundant/duplicate computation, checksums, and periodic known-answer tests; quarantine on mismatch.

High-level design

The architecture is a classic observability control loop: a lightweight node agent on every host scrapes the driver, ships metrics and events through a streaming pipeline, a detection engine turns that firehose into incidents, and a remediation controller acts on them by driving the scheduler to cordon and drain. The time-series store feeds dashboards and alerting in parallel.

flowchart LR
    GPU["GPU Fleet (100k)"] --> AGENT["Node Agent (DCGM)"]
    AGENT -->|metrics + events| PIPE["Streaming Pipeline (Kafka)"]
    PIPE --> TSDB["Metrics Store (TSDB)"]
    PIPE --> DET["Detection / Anomaly Engine"]
    TSDB --> DET
    DET -->|incident| REM["Remediation Controller"]
    REM -->|cordon / drain| SCHED["Cluster Scheduler"]
    SCHED --> GPU
    DET --> ALERT["Alerting (on-call)"]
    TSDB --> DASH["Dashboards (Grafana)"]
      

Deep dive: detection

Detection layers four techniques, cheapest and loudest first. The goal is to catch hard faults instantly, catch slow/silent faults reliably, and keep the false-positive rate low enough that automated remediation is trusted.

The end-to-end flow from a detected fault through cordon, drain, and remediation back into the pool:

sequenceDiagram
    participant A as Node Agent
    participant D as Detection Engine
    participant R as Remediation Controller
    participant S as Scheduler
    participant J as Training Job
    A->>D: Stream telemetry (XID, ECC, step time)
    D->>D: Threshold + anomaly + straggler check
    D->>R: Raise incident (GPU 7, node N)
    R->>S: Cordon node N (no new work)
    R->>J: Signal: exclude node N
    J-->>R: Checkpointed, rank released
    R->>R: Drain + GPU reset / reboot
    alt Health check passes
        R->>S: Return node to pool
    else Still failing
        R->>R: Open RMA, keep cordoned
    end
      

Detection latency vs false positives

The two failure modes of a detector pull in opposite directions. Page too eagerly and you drain healthy GPUs, wasting money and triggering needless checkpoint-restarts; wait too long and a sick GPU stalls the job or corrupts data. The resolution is to tier severity: act in milliseconds on unambiguous hard faults, but require confirmation (a second observation or an active benchmark) before evicting on soft anomalies and stragglers.

Deep dive: remediation

Detection is worthless without a fast, safe recovery loop. Remediation must do two jobs at once: heal the hardware (or get it out of the fleet) and keep the training job alive so a single bad GPU doesn't cost a full restart.

flowchart TD
    INC["Incident: unhealthy GPU"] --> COR["Cordon (block new work)"]
    COR --> DR["Drain running tasks"]
    DR --> RST{"Recoverable in place?"}
    RST -->|"soft: ECC SBE, hang"| GR["GPU reset / node reboot"]
    RST -->|"hard: DBE, off bus"| RMA["Open RMA ticket"]
    GR --> HC{"Active health check passes?"}
    HC -->|yes| POOL["Return to pool"]
    HC -->|no| RMA
    RMA --> TECH["Datacenter tech swaps HW"]
    TECH --> POOL
      

Trade-off: automated vs human-in-the-loop

Full automation is mandatory for the common, well-understood faults — at minutes-between-failures a human cannot keep up, and a cordon/drain/reset loop is safe and reversible. But automation should be conservative on destructive or ambiguous actions: auto-cordon freely, auto-reset readily, but gate mass drains (e.g., a whole rack flagged at once — likely a monitoring or network blip, not 64 dead GPUs) and RMA decisions behind confirmation or a human. The design principle: cheap and reversible → automate; expensive and irreversible → require a second signal or a person.

Bottlenecks & scaling

A health system rarely fails on raw throughput — it fails on signal quality: drowning in telemetry, crying wolf, reacting too slowly, or never seeing the silent faults at all. The mitigations push work to the edge, tier severity, and accept that SDC needs a fundamentally different (redundancy-based) approach.

Bottleneck Why it hurts Mitigation
Telemetry ingest volume Millions of points/sec from 12,500 hosts can swamp the pipeline and the TSDB, and naive 1 s scraping wastes CPU/network. Pre-aggregate at the agent; adaptive sampling (slow when healthy, fast around incidents); partition the stream by host; downsample + tier retention in the TSDB; keep events lossless but gauges droppable.
False positives vs missed failures Over-eager drains waste $$ GPUs and trigger needless restarts; missed faults stall jobs or corrupt models. Tier severity (instant on hard faults, confirm on soft); require a second observation or active benchmark before eviction; suppress/dedup correlated alerts; track precision/recall per rule and tune.
Detection latency Every second a sick GPU runs, 99,999 others may be blocked at the barrier — latency multiplies across the gang. Evaluate hard-fault rules on the event stream (ms), not by polling the TSDB; push critical events from the agent immediately; co-locate detectors with the stream; keep heavy ML detectors off the hot path.
SDC is fundamentally hard Silent corruption emits no metric to threshold; you cannot detect the absence of an error. Manufacture a signal: redundant/duplicate compute + checksums, periodic known-answer tests, trainer-level NaN/grad-norm guards; accept sampling cost; quarantine + reconfirm on a known-good device before blaming hardware.
Monitoring blind during failures A wedged host can't report that it's wedged; in-band agents die with the node they watch. Out-of-band BMC/IPMI liveness + power/thermal; treat heartbeat loss as its own signal; run the control plane on separate infrastructure from the GPU fleet.
Correlated / cascading alerts One failed NVSwitch or rack PDU lights up dozens of GPUs at once, burying the root cause in noise. Topology-aware correlation (roll up GPU → node → NVSwitch → rack); alert on the cause, suppress the symptoms; gate mass actions behind confirmation.

Summary — what a staff answer nails

Lead with the defining constraint: training is all-or-nothing and bulk-synchronous, so one sick GPU stalls 100k, and silent data corruption is worse than a clean crash. Build an observability control loop — cheap DCGM-style agents stream telemetry and XID/ECC events through a partitioned log into a detection engine and a TSDB. Tier detection: instant threshold rules for hard faults, peer/baseline anomaly + straggler detection for degradation, active NCCL/burn-in health checks before scheduling, and redundant-compute / checksums for SDC. Close the loop with an automated cordon → drain → reset/reboot → RMA ladder that keeps the job alive via checkpoint-restart and hot spares. Tune for low MTTR and a low false-positive rate, automate the cheap/reversible actions, keep humans on the expensive/irreversible ones, and stay observable through out-of-band paths when the host itself goes dark.