AI / ML Infrastructure

GPU Health Monitoring & Failure Detection

A modern training run spreads one tightly-coupled computation across tens of thousands of GPUs, and the whole job moves at the speed of its sickest device. A single accelerator that throws an ECC error, falls off the PCIe bus, or simply runs slow can stall or crash a 100,000-GPU job, vaporizing hours of compute across the entire fleet. Worse, a GPU can return wrong answers without crashing — silent data corruption (SDC) — quietly poisoning a model for days. A GPU health system continuously collects telemetry from every device, detects both loud and silent failures, pinpoints the exact culprit, and automatically drains and remediates it so the job keeps moving. This is the immune system behind NVIDIA DCGM, the DGX SuperPOD, and every hyperscaler's GPU fleet.

Requirements

Frame the system as a fleet immune system: it watches every GPU, recognizes the early symptoms of disease, isolates the infected node before it spreads, and heals it back into service — all without taking down the patient (the training job) it is trying to protect.

Why it matters

Distributed training is bulk-synchronous: every rank must reach each all-reduce barrier together, so one bad GPU stalls all 100k. A hard crash is actually the easy case — you notice immediately and restart from a checkpoint. The nightmare is silent data corruption: a GPU that quietly computes 2 + 2 = 5 produces no error, no crash, no log line — it just corrupts gradients for days until someone notices the loss curve is wrong and burns a week of fleet time bisecting which node lied. Silent failure is far worse than a clean crash.

Functional	Non-functional
Collect telemetry from every GPU/host — temperature, power, clocks, ECC errors, `XID` events, NVLink/PCIe errors, SM utilization, HBM row-remap.	Fast detection / low MTTR — flag hard failures within seconds and complete remediation within minutes; MTTR dominates lost GPU-hours.
Detect hard failures (XID, double-bit ECC, fell-off-bus, NVLink down, thermal trip) and silent failures (SDC, stragglers).	Low overhead — the agent must use negligible GPU/CPU/network; it cannot steal training cycles or perturb the very metrics it measures.
Locate the culprit — pinpoint the exact `GPU / host / NVLink / cable`, not just "the job died". Root-cause to a serial number.	Low false-positive rate — a needless drain evicts a job and wastes ultra-expensive GPUs; flapping erodes trust in automation.
Drain & remediate — cordon the node, evacuate work, attempt GPU reset / reboot, escalate to RMA, and return healthy nodes to the pool.	High recall — a missed SDC silently corrupts the model; for the worst class of fault, misses cost more than false alarms.
Keep training alive — signal the scheduler/job to exclude the bad node, restart from checkpoint, and swap in a hot spare.	Scale to 100k GPUs — ingest and evaluate fleet-wide telemetry in near-real-time without a central choke point.
Dashboards, history & alerting — fleet health KPIs, per-node timelines, and on-call paging with deduplicated incidents.	Reliable & isolated — monitoring must survive the failures it watches; prefer out-of-band paths (BMC/IPMI) that work even when the host is wedged.

Scale & estimates

Two numbers drive the design: the telemetry ingest rate (how much data per second the pipeline must absorb) and the fleet failure rate (how often something breaks). The first proves the pipeline must be a horizontally-scaled stream, not a database you poll. The second proves remediation must be automated — at this scale a human cannot keep up.

Dimension	Estimate	How / notes
GPUs in fleet	100,000	12,500 hosts × 8 GPUs. Plus NICs, NVSwitches, optics, PSUs — all monitored.
Metrics per GPU per sample	~30 fields	DCGM exposes temp, power, SM/mem util, SM clocks, ECC SBE/DBE counts, XID, NVLink bandwidth + error counters, PCIe replay, HBM remapped rows, throttle reasons.
Sample interval	1-10 s	Fast (1 s) for error counters and step-time; slow (10 s) for slow-moving temp/power. Adaptive: speed up around an incident.
Sustained ingest	~0.6-3 M points/s	100k × 30 / 5 s ≈ 600k/s; at 1 s ≈ 3 M/s. ≈ 50-260 B points/day.
Raw write volume	~1-8 TB/day	~25 bytes/point on the wire; far less at rest after delta-of-delta + compression in the TSDB.
Hardware events (XID/ECC)	sparse but bursty	Most samples are boring; a failing GPU emits a storm. Events must be lossless and ordered, unlike gauges which can drop a sample.
Fleet failure interval	minutes	Even at a per-accelerator MTBF of years, 100k devices means something fails every few minutes. Automation, not pagers.

Back-of-envelope: the MTBF wall

Meta's published Llama 3 405B run used 16,384 GPUs and hit 419 unexpected interruptions over 54 days — roughly one every ~3 hours, with about 78% traced to hardware (GPUs and HBM dominating). Scale that to 100k GPUs and the mean time between fleet failures collapses to minutes. The takeaway: at this scale failure is the steady state, so detection and remediation must be a fast, automated control loop, and jobs must checkpoint frequently enough that the lost work between failures stays small.

Failure signals & taxonomy

Good detection starts with knowing what a sick GPU looks like. Signals split into three buckets by how they announce themselves: hard failures shout (an error code, a dead link), silent data corruption says nothing but lies, and stragglers are alive and correct but slow. Each bucket needs a different detector and a different response.

flowchart TD
    S["GPU Signal"] --> H{"Hard, silent, or slow?"}
    H -->|hard| HF["XID / ECC DBE / Fell off bus / NVLink down / Thermal"]
    H -->|silent| SDC["Silent Data Corruption (wrong math)"]
    H -->|slow| STR["Straggler (slow step time)"]
    HF --> ACT1["Immediate cordon + drain"]
    SDC --> ACT2["Quarantine + verify (redundant compute)"]
    STR --> ACT3["Isolate + benchmark"]

Signal	What it means	Action
`XID 79` — GPU fell off the bus	The driver lost PCIe contact with the GPU — typically dead silicon, a bad seating, or power fault. Fatal.	Cordon immediately, kill the job's tasks on that host, reset/reboot; escalate to RMA if it recurs.
`XID 48 / 94 / 95` — double-bit ECC (DBE)	An uncorrectable memory error in HBM. Data is already wrong; the GPU may halt the context.	Treat as hard fault: drain, run row-remap, re-test. Persistent DBEs → RMA the board.
Single-bit ECC (SBE) rate	Corrected automatically, but a rising SBE rate predicts an imminent DBE — a leading indicator.	Watch the trend; pre-emptively schedule drain at the next checkpoint before it becomes uncorrectable.
HBM row-remap pending / failure (`XID 63/64`)	Memory is retiring bad rows; a remap failure means it ran out of spare rows.	Remap-pending needs a reset to take effect; remap-failure → RMA.
NVLink / NVSwitch error or down	An intra-node GPU-to-GPU link degraded or dropped — all-reduce bandwidth craters or the collective hangs.	Diagnose link vs switch vs cable; if a link stays down, drain the node (the gang can't run fast).
PCIe replay / AER errors	Correctable bus errors; a rising replay rate signals a marginal slot, riser, or cable.	Trend + threshold; schedule maintenance before it becomes a fell-off-bus.
Thermal throttle / over-temp / power cap	The GPU is clocking down to protect itself — usually cooling (fan, pump, dust) or a hot rack, not the die.	Alert facilities; the GPU is healthy but the job is slow — often a straggler root cause.
Straggler — high step time on one rank	This GPU is correct but consistently the last to reach the barrier; it drags the whole job's throughput.	Compare per-rank step time; isolate, benchmark (NCCL/burn-in), and evict if it stays slow.
Silent data corruption (SDC)	The GPU computes wrong results with no error and no crash — the most dangerous failure mode.	Catch via redundant/duplicate computation, checksums, and periodic known-answer tests; quarantine on mismatch.

High-level design

The architecture is a classic observability control loop: a lightweight node agent on every host scrapes the driver, ships metrics and events through a streaming pipeline, a detection engine turns that firehose into incidents, and a remediation controller acts on them by driving the scheduler to cordon and drain. The time-series store feeds dashboards and alerting in parallel.

flowchart LR
    GPU["GPU Fleet (100k)"] --> AGENT["Node Agent (DCGM)"]
    AGENT -->|metrics + events| PIPE["Streaming Pipeline (Kafka)"]
    PIPE --> TSDB["Metrics Store (TSDB)"]
    PIPE --> DET["Detection / Anomaly Engine"]
    TSDB --> DET
    DET -->|incident| REM["Remediation Controller"]
    REM -->|cordon / drain| SCHED["Cluster Scheduler"]
    SCHED --> GPU
    DET --> ALERT["Alerting (on-call)"]
    TSDB --> DASH["Dashboards (Grafana)"]

Node agent (DCGM-style). A daemon per host queries the GPU driver via DCGM/NVML for field groups, tails the kernel log for XID events, and runs lightweight in-band health checks. It exports metrics (Prometheus-style) and pushes critical events immediately. Crucially it is cheap and runs at low priority so it never competes with training. An out-of-band path via the host BMC/IPMI reports power, fans, and liveness even when the OS is hung.
Streaming pipeline. A partitioned log (Kafka) decouples 12,500 producers from the consumers and absorbs bursts when a rack lights up with errors. Gauges can be downsampled; hardware events are kept lossless and ordered per host (partition by host id).
Metrics store (TSDB). A horizontally-sharded time-series DB (VictoriaMetrics / Prometheus / Cortex-style) with delta-of-delta compression and tiered retention — high resolution for days, downsampled for months — powering dashboards and ad-hoc root-cause queries.
Detection / anomaly engine. Stream processors evaluate rules and models continuously: hard-fault rules fire in milliseconds off the event stream, while anomaly/straggler detectors compare a GPU against its own history and against its peers in the same job. Output is a deduplicated incident keyed by device, not a raw alert per sample.
Remediation controller. A state machine that owns the node lifecycle (healthy → suspect → cordoned → draining → remediating → healthy | RMA). It drives the scheduler to stop placing work and to evacuate the gang, then attempts recovery. It is the only component allowed to take disruptive action, which keeps policy in one auditable place.
Scheduler integration. Cordoning marks the node unschedulable; draining signals the running job to checkpoint and release the rank so a hot spare can take its place. (See the GPU cluster scheduler for the placement side.)
Dashboards & alerting. Fleet KPIs (healthy %, MTBF, MTTR, drains/day), per-node timelines, and on-call paging — with suppression so one failing NVSwitch doesn't page eight times.

Deep dive: detection

Detection layers four techniques, cheapest and loudest first. The goal is to catch hard faults instantly, catch slow/silent faults reliably, and keep the false-positive rate low enough that automated remediation is trusted.

1. Threshold rules (hard faults). Deterministic and fast: any XID in the fatal set, a double-bit ECC, a fell-off-bus, an NVLink down, or a row-remap failure is an instant incident — no learning required. These fire off the event stream in milliseconds and cover the loud majority.
2. Anomaly detection (degradation). Many faults creep in: a slowly rising SBE rate, climbing PCIe replays, a GPU running hotter than its rack peers at equal load. Compare each device to its own baseline and to the distribution of its peers (robust z-score / EWMA / simple ML). Peer comparison is powerful because all ranks in a job do identical work, so an outlier is suspicious by construction.
3. Straggler detection. The job (or agent) reports per-rank step time. In a synchronous job every rank should take ~the same time; a rank that is consistently the last to the barrier is dragging global throughput even though it never errors. Flag the persistent tail of the step-time distribution, then confirm with an active benchmark before evicting.
4. Active health checks. Don't wait for production to find the fault — test before scheduling. When a node enters the pool (or after any remediation) run a short GPU burn-in and a NCCL all-reduce / bandwidth test across its 8 GPUs and NVLinks. A node only returns to the schedulable pool after it passes, which stops a marginal GPU from poisoning the next job.
5. SDC detection (the hard one). Silent corruption emits no signal, so you must manufacture one: redundant/duplicate computation (run a sampled op on two devices and compare, or recompute and checksum), periodic known-answer tests woven into idle cycles, and numerical guardrails in the trainer (NaN/Inf checks, gradient-norm spikes, loss divergence). On mismatch, quarantine the suspect and re-run on a known-good device to confirm before blaming it. This is expensive, so it's applied as sampling + on-suspicion, not to every FLOP.

The end-to-end flow from a detected fault through cordon, drain, and remediation back into the pool:

sequenceDiagram
    participant A as Node Agent
    participant D as Detection Engine
    participant R as Remediation Controller
    participant S as Scheduler
    participant J as Training Job
    A->>D: Stream telemetry (XID, ECC, step time)
    D->>D: Threshold + anomaly + straggler check
    D->>R: Raise incident (GPU 7, node N)
    R->>S: Cordon node N (no new work)
    R->>J: Signal: exclude node N
    J-->>R: Checkpointed, rank released
    R->>R: Drain + GPU reset / reboot
    alt Health check passes
        R->>S: Return node to pool
    else Still failing
        R->>R: Open RMA, keep cordoned
    end

Detection latency vs false positives

The two failure modes of a detector pull in opposite directions. Page too eagerly and you drain healthy GPUs, wasting money and triggering needless checkpoint-restarts; wait too long and a sick GPU stalls the job or corrupts data. The resolution is to tier severity: act in milliseconds on unambiguous hard faults, but require confirmation (a second observation or an active benchmark) before evicting on soft anomalies and stragglers.

Deep dive: remediation

Detection is worthless without a fast, safe recovery loop. Remediation must do two jobs at once: heal the hardware (or get it out of the fleet) and keep the training job alive so a single bad GPU doesn't cost a full restart.

flowchart TD
    INC["Incident: unhealthy GPU"] --> COR["Cordon (block new work)"]
    COR --> DR["Drain running tasks"]
    DR --> RST{"Recoverable in place?"}
    RST -->|"soft: ECC SBE, hang"| GR["GPU reset / node reboot"]
    RST -->|"hard: DBE, off bus"| RMA["Open RMA ticket"]
    GR --> HC{"Active health check passes?"}
    HC -->|yes| POOL["Return to pool"]
    HC -->|no| RMA
    RMA --> TECH["Datacenter tech swaps HW"]
    TECH --> POOL

Cordon & drain. First, stop the bleeding: mark the node unschedulable so no new work lands, then evacuate the current task. Cordon is cheap and reversible — do it the instant a node looks suspect, even before you're sure.
Escalating recovery ladder. Try the cheapest fix that could work, escalating only on failure: GPU reset (reload the driver context, apply a pending row-remap) → node reboot (clears wedged state, re-seats the driver) → power-cycle via BMC for a truly hung host → RMA when the fault is hardware (persistent DBE, fell-off-bus, remap-failure). Each rung is gated by an active health check — a node only graduates back to the pool after it passes burn-in + NCCL.
RMA workflow. A hardware failure becomes a tracked ticket: capture diagnostics, label the device by serial, route to a datacenter tech for physical swap, and keep the slot cordoned until verified. Fleet-level analytics over RMAs surface bad batches, hot racks, and firmware regressions.
Keep training alive. This is what separates a fleet tool from a checkbox. Rather than killing a 16k-GPU job because one GPU died, the controller signals the job to exclude the bad node and either run elastically at reduced width or, more commonly, swap in a pre-warmed hot spare and restart from the latest checkpoint. Frequent checkpointing bounds the lost work to the gap between failures — which, at 100k GPUs, is small by necessity.
Hot spares & capacity buffer. Keep a few percent of nodes idle, burned-in, and ready. When a rank fails, a spare slots in within the checkpoint window instead of forcing the gang to wait for a repair — trading a little standing capacity for a lot of recovered uptime.

Trade-off: automated vs human-in-the-loop

Full automation is mandatory for the common, well-understood faults — at minutes-between-failures a human cannot keep up, and a cordon/drain/reset loop is safe and reversible. But automation should be conservative on destructive or ambiguous actions: auto-cordon freely, auto-reset readily, but gate mass drains (e.g., a whole rack flagged at once — likely a monitoring or network blip, not 64 dead GPUs) and RMA decisions behind confirmation or a human. The design principle: cheap and reversible → automate; expensive and irreversible → require a second signal or a person.

Bottlenecks & scaling

A health system rarely fails on raw throughput — it fails on signal quality: drowning in telemetry, crying wolf, reacting too slowly, or never seeing the silent faults at all. The mitigations push work to the edge, tier severity, and accept that SDC needs a fundamentally different (redundancy-based) approach.

Bottleneck	Why it hurts	Mitigation
Telemetry ingest volume	Millions of points/sec from 12,500 hosts can swamp the pipeline and the TSDB, and naive 1 s scraping wastes CPU/network.	Pre-aggregate at the agent; adaptive sampling (slow when healthy, fast around incidents); partition the stream by host; downsample + tier retention in the TSDB; keep events lossless but gauges droppable.
False positives vs missed failures	Over-eager drains waste $$ GPUs and trigger needless restarts; missed faults stall jobs or corrupt models.	Tier severity (instant on hard faults, confirm on soft); require a second observation or active benchmark before eviction; suppress/dedup correlated alerts; track precision/recall per rule and tune.
Detection latency	Every second a sick GPU runs, 99,999 others may be blocked at the barrier — latency multiplies across the gang.	Evaluate hard-fault rules on the event stream (ms), not by polling the TSDB; push critical events from the agent immediately; co-locate detectors with the stream; keep heavy ML detectors off the hot path.
SDC is fundamentally hard	Silent corruption emits no metric to threshold; you cannot detect the absence of an error.	Manufacture a signal: redundant/duplicate compute + checksums, periodic known-answer tests, trainer-level NaN/grad-norm guards; accept sampling cost; quarantine + reconfirm on a known-good device before blaming hardware.
Monitoring blind during failures	A wedged host can't report that it's wedged; in-band agents die with the node they watch.	Out-of-band BMC/IPMI liveness + power/thermal; treat heartbeat loss as its own signal; run the control plane on separate infrastructure from the GPU fleet.
Correlated / cascading alerts	One failed NVSwitch or rack PDU lights up dozens of GPUs at once, burying the root cause in noise.	Topology-aware correlation (roll up GPU → node → NVSwitch → rack); alert on the cause, suppress the symptoms; gate mass actions behind confirmation.

Summary — what a staff answer nails

Lead with the defining constraint: training is all-or-nothing and bulk-synchronous, so one sick GPU stalls 100k, and silent data corruption is worse than a clean crash. Build an observability control loop — cheap DCGM-style agents stream telemetry and XID/ECC events through a partitioned log into a detection engine and a TSDB. Tier detection: instant threshold rules for hard faults, peer/baseline anomaly + straggler detection for degradation, active NCCL/burn-in health checks before scheduling, and redundant-compute / checksums for SDC. Close the loop with an automated cordon → drain → reset/reboot → RMA ladder that keeps the job alive via checkpoint-restart and hot spares. Tune for low MTTR and a low false-positive rate, automate the cheap/reversible actions, keep humans on the expensive/irreversible ones, and stay observable through out-of-band paths when the host itself goes dark.