AI / ML Infrastructure

GPU Workload Observability: Metrics, Traces & Profiles

A 100k-GPU training fleet burns millions of dollars an hour, yet the question owners keep asking is embarrassingly hard to answer: "why is my job slow?" Concretely that decomposes into three questions — why is MFU low? (are the GPUs even doing useful math), which rank is the straggler? (one slow GPU stalls all of them at the next all-reduce), and where is the time going? (compute vs. communication vs. data loading). Answering them needs all three observability pillars wired together — fleet metrics (DCGM/exporters), distributed traces correlated across thousands of ranks (OpenTelemetry), and on-demand kernel profiles (Nsight Systems, PyTorch profiler, Perfetto, Holistic Trace Analysis) — at a scale and overhead budget where you cannot simply profile everything.

Requirements

GPU observability is not "Prometheus, but for GPUs." The workloads are bulk-synchronous distributed jobs: thousands of ranks march in lock-step and the slowest one sets the pace, so a per-host dashboard hides the real problem. The platform must let an engineer pivot from a fleet symptom (MFU dropped) to the guilty rank, to the guilty step, to the guilty CUDA kernel — and do it without adding meaningful overhead to the very jobs it watches.

The three questions every GPU observability stack must answer

1. Why is MFU low? MFU = achieved model FLOPs / peak FLOPs. A run at 35% MFU is leaving most of a $30k accelerator on the floor — but the cause could be small batch, memory-bound kernels, comm overlap failure, or input-pipeline starvation. 2. Which rank is the straggler? One throttling or ECC-degraded GPU forces every peer to wait at the next collective. 3. Where is the time going? Split each step into compute / communication / data and you know whether to fix kernels, the network, or the data loader.

Functional	Non-functional
Collect metrics — per-GPU utilization, MFU, HBM use/bandwidth, power, NVLink/IB bandwidth, temperature, clocks, ECC/XID and throttle reasons.	Low overhead — the agent/exporter footprint must stay tiny (target a few % of one CPU core, near-zero GPU); never perturb the workload it measures.
Collect traces — distributed spans per step (forward, backward, optimizer, each collective) sharing a correlation key across all ranks of a job.	Scale to 100k GPUs — tens of millions of active series, millions of samples/sec, thousands of concurrent jobs, ingest that never drops a straggler signal.
Collect profiles — on-demand kernel-level timelines (Nsight Systems, PyTorch profiler/CUPTI) for selected ranks and a handful of steps.	High-cardinality, bounded — handle labels like `gpu_uuid`, `job`, `rank` while preventing a `kernel_name`/`step` cardinality blow-up.
Correlate across ranks — line up the same logical step across thousands of ranks despite clock skew; surface the slowest rank automatically.	Low query latency — fleet dashboards and "which rank is slow?" queries return in well under a second over hot data.
Dashboards & alerts — fleet/job/rank views plus alerting on MFU drop, straggler divergence, loss anomalies, and GPU faults.	Retention & cost — keep recent data full-resolution, downsample old data, and keep total telemetry spend a small fraction of compute spend.

Scale & estimates

Sizing for one large fleet: 100k GPUs = 12,500 nodes × 8 GPUs. The headline is that metrics are a firehose but tractable, traces are only affordable if you sample, and profiles are so large they never belong in the metrics path at all — they go to object storage and are fetched on demand.

Dimension	Estimate	How / notes
Series per GPU	~100	util, SM activity/occupancy, HBM used/bandwidth, power, temp, clocks, ECC, plus ~36 NVLink tx/rx counters.
Active time series	~15-20M	100k GPUs × 100 = 10M GPU series; + node/host (~6M), IB fabric (~1M), per-job framework counters (~1M).
Metrics ingest rate	~1.5M-15M samples/s	15M series at a 10s scrape = 1.5M/s; drop to 1s for fine-grained capture and it is ~15M/s.
Metrics volume	~0.2-0.5 TB/day	~130B samples/day at 10s × ~1-2 bytes/sample compressed (delta-of-delta + XOR float).
Trace volume (naive)	infeasible	~30 spans/rank/step × 100k ranks = 3M spans/step; a step every ~1-5s → millions of spans/sec. Must sample.
Profile size	~0.1-3 GB / rank	Nsight/PyTorch trace for a few steps of one rank. Profiling 256 ranks → tens-to-hundreds of GB per capture.
Cardinality hazards	unbounded	`gpu_uuid` (100k) is fine as a label; `kernel_name` (1000s) and `step` (ever-growing) are not — never put them on hot metrics.
Retention tiers	15d / 90d / 1-2y	raw hot 15d → 1-min rollup 90d → 1-hour rollup 1-2y. Traces 3-7d. Profiles kept per-incident in object store.

The asymmetry that shapes the design

Metrics are small and continuous (always on, every GPU). Traces are medium and bursty (only sampled steps). Profiles are huge and rare (one engineer, one job, a few steps). Three data shapes → three storage backends, not one.

Three pillars for GPU workloads

The classic metrics/traces/logs triad re-specializes for accelerators. Logs still exist, but the load- bearing pillars are metrics (is the fleet healthy and busy?), traces (which rank/step is slow, and is it compute or comm?), and profiles (which exact kernel, and why?). Each answers a different question at a different cost and frequency.

Pillar	Signals	Tools	Answers
Metrics always-on, every GPU	SM utilization & occupancy, MFU, HBM used/bandwidth, power (W), NVLink/IB bandwidth, temperature, clocks, ECC/XID, throttle reasons.	DCGM-exporter, node-exporter, Prometheus / VictoriaMetrics / Mimir, framework counters (throughput, loss, grad-norm).	Is the fleet healthy and busy? Fleet/job MFU trend, hot/throttling/faulty GPUs, capacity.
Traces sampled steps, across ranks	Per-step spans: forward, backward, optimizer, and each collective (all-reduce / all-gather / reduce-scatter), tagged with `job`, `rank`, `step`.	OpenTelemetry SDK + framework hooks, PyTorch profiler trace export, distributed trace store (Tempo / Jaeger-style).	Which rank is the straggler? What fraction of the step is comm vs. compute vs. data?
Profiles on-demand, few ranks/steps	Kernel-level timelines: per-kernel duration, launch overhead, memcpy, stream gaps/bubbles, CPU-vs-GPU stalls, occupancy.	Nsight Systems/Compute, PyTorch profiler + CUPTI, Perfetto / Chrome trace UI, Holistic Trace Analysis (HTA).	Which exact kernel is slow and why? Where are the bubbles between kernels?

The pillars are linked by shared keys — job_id, rank, gpu_uuid, step — so a single click drills from an MFU dip (metric) to the slow step (trace) to the offending kernel (profile). Without those join keys you have three disconnected tools and a human doing the correlation by hand.

High-level design

On every node, lightweight exporters/agents emit metrics and trace spans; a profiler hook captures kernel timelines only when armed. Metrics and traces flow through a streaming ingest bus (Kafka) that decouples producers from stores and absorbs bursts, then fan out to a time-series DB (metrics) and a trace store. Profiles are far too large for the bus, so the hook writes them straight to object storage and only a small pointer (job/rank/step + URL) travels through ingest. A query + dashboard layer joins the three by shared keys, and an alerting path watches the streams for MFU drops, stragglers, and faults.

flowchart LR
    subgraph NODE["GPU Node (x12,500)"]
        DCGM["DCGM Exporter (metrics)"]
        OTEL["OTel Agent (trace spans)"]
        PROF["Profiler Hook (CUPTI/Nsight)"]
    end
    DCGM --> KAFKA["Streaming Ingest (Kafka)"]
    OTEL --> KAFKA
    PROF --> OBJ["Profile / Object Store"]
    KAFKA --> TSDB["Time-series DB (metrics)"]
    KAFKA --> TRACE["Trace Store"]
    KAFKA -->|pointer| OBJ
    TSDB --> QY["Query + Dashboards"]
    TRACE --> QY
    OBJ --> QY
    TSDB --> AL["Alerting"]
    TRACE --> AL
    QY --> ENG["SRE / ML Engineer"]
    AL --> ENG

Streaming ingest decouples and protects. Kafka lets a node burst telemetry without stalling the workload, gives the stores back-pressure-free buffering, and provides a single place to sample, aggregate, and route. A storage hiccup never blocks the training loop.
Three stores for three data shapes. A TSDB (columnar, delta-compressed) for metrics; a span-indexed trace store for distributed traces; cheap object storage for multi-GB profiles. Forcing all three into one backend is how teams accidentally build a petabyte metrics bill.
Profiles bypass the hot path. Only a tiny descriptor flows through ingest; the bytes go node → object store directly. This keeps the pipeline cheap and makes "profile rank 12, steps 400-405" an O(few-GB) object write rather than a firehose.
The query layer is the join engine. Drilling from metric → trace → profile works because every signal carries job/rank/gpu_uuid/step; the dashboard resolves those keys across the three stores on demand.

Deep dive: distributed tracing & profiling at scale

Tracing a single service is easy; tracing one training step across thousands of ranks is the hard problem. Every rank runs the same code at (almost) the same time, and the thing you care about is the skew between them. Three ideas make it work: a logical correlation key, aggressive sampling, and on-demand deep profiling.

Correlating a step across ranks

The correlation key is not wall-clock time — node clocks drift by milliseconds, which is an eternity at GPU speed. Instead, every rank tags its spans with (job_id, step) and the platform aligns them on the logical step boundary and on shared NCCL collective barriers, which are true synchronization points. Aligning on a barrier turns "30k noisy timelines" into a clean apples-to-apples comparison where the straggler pops out visually.

flowchart TD
    STEP["Training step N (all ranks)"] --> KEY["Correlation key = job + step N"]
    KEY --> SAMP{"Sample this step?"}
    SAMP -->|no| DROP["Metrics only (cheap path)"]
    SAMP -->|yes| SPAN["Per-rank spans: fwd, bwd, all-reduce"]
    SPAN --> ALIGN["Align on NCCL barrier (not wall clock)"]
    ALIGN --> VIEW["Timeline / flame view, slowest rank flagged"]

You cannot profile every step or every GPU

Full tracing is millions of spans per second and detailed profiling steals real GPU cycles, so capture is layered:

Always-on metrics on every GPU — cheap, the baseline that triggers everything else.
Head sampling of traces — emit step-level spans for rank 0 plus a small random set of ranks, and only on 1 of N steps. Enough to watch comm/compute ratio trend without the firehose.
Tail / triggered sampling — when a step is anomalously slow or MFU dips, retain the spans for that window across all ranks so the straggler is never sampled away.
On-demand deep profiling — kernel-level Nsight/PyTorch capture, armed manually or by an alert, for a handful of ranks and a handful of steps, uploaded to object storage.

On-demand deep profiling flow

A controller arms the profiler hook on the selected ranks for a bounded step window, each rank uploads its trace chunk, and the platform stitches and indexes the chunks by step so they line up into one aligned timeline (Perfetto / HTA) instead of N unrelated files.

sequenceDiagram
    participant U as ML Engineer
    participant C as Profiler Controller
    participant R0 as Rank 0
    participant Rk as Rank k
    participant S as Object Store
    U->>C: Deep profile job for 5 steps
    C->>R0: Enable profiler steps N..N+5
    C->>Rk: Enable profiler steps N..N+5
    R0-->>S: Upload trace chunk
    Rk-->>S: Upload trace chunk
    C->>S: Stitch and index by step
    S-->>U: Aligned timeline ready

The payoff is the flame/timeline view: per-rank kernel timelines stacked on a common step axis. A straggler shows as a rank whose compute lane is longer or whose all-reduce starts late; a data-starvation problem shows as GPU idle gaps waiting on the input pipeline; a comm problem shows as collectives dominating the step. Holistic Trace Analysis then aggregates many ranks into one verdict so you are not eyeballing 30k lanes.

Deep dive: controlling telemetry cost & cardinality

Observability that costs more than a few percent of compute is a failed design. Because metrics are always-on across 100k GPUs, the levers are about writing less, storing it cheaper as it ages, and refusing to explode cardinality — while still keeping the signals that fire the alerts that justify the whole platform.

Lever	What it does	Trade-off
Sampling	Trace only 1-of-N steps and a subset of ranks; profile on demand only.	Lower fidelity between samples; mitigate with triggered/tail sampling so anomalies are always kept.
Aggregation / rollups	Pre-aggregate per-job/per-node at ingest (e.g., job MFU, p99 step time) instead of storing every raw series forever.	Lose per-GPU detail in the rollup; keep raw hot for a short window for drill-down.
Downsampling old data	Raw → 1-min → 1-hour as data ages (15d / 90d / 1-2y tiers).	Old data answers "trend", not "what happened at 14:03:07"; acceptable for capacity/SLO history.
High-cardinality control	Keep `gpu_uuid`/`job`/`rank` as labels; move `kernel_name`/`step` into traces/profiles or exemplars; cap series per job.	Cannot slice hot metrics by kernel — but that is exactly what profiles are for.
Overhead budget	Cap agent at a few % of one core, near-zero GPU; profiling (which steals cycles) is off by default and time-boxed.	Deep detail is not continuously available — you arm it when you need it.

Alerting: the signals worth waking someone for

MFU drop — job MFU falls below its rolling baseline for several steps (silent regressions burn money without crashing). Straggler — one rank's step time diverges from the cohort median by more than a threshold (the all-reduce makes everyone pay for it). Loss anomaly — NaN/Inf or a loss spike (catch a diverging run before it wastes a day of GPUs). Hardware — ECC/XID errors, NVLink down, or thermal/power throttling on any GPU. Each alert deep-links straight to the trace/profile drill-down so the responder starts at the evidence, not the dashboard.

Bottlenecks & scaling

Every part of this system has a failure mode that shows up only at fleet scale. The recurring theme: push work to the edge and to the right storage tier — pre-aggregate and sample at the agent/ingest layer, keep hot data small, and never let an unbounded label reach the TSDB.

Bottleneck	Why it happens	Mitigation
Metrics volume	15M+ series at 1Hz overwhelms a single TSDB and the write path.	Shard/cluster the TSDB; coarse default scrape (10s); pre-aggregate at ingest; raise resolution only when armed.
Cardinality explosion	A `kernel_name` or `step` label multiplies series into the billions and melts the index.	Forbid unbounded labels on hot metrics; route that detail to traces/profiles/exemplars; enforce per-job series caps.
Profile data size	Multi-GB profiles per rank would saturate the pipeline if shipped through it.	Write profiles node → object store directly; pass only pointers; capture on demand for few ranks/steps.
Query latency	Fleet-wide "which rank is slow?" scans tens of millions of series.	Downsampled rollups for wide ranges; pre-computed per-job/per-rank aggregates; hot/cold tiering; cache common dashboards.
Agent overhead	Heavy collection or always-on profiling steals CPU/GPU cycles from training.	Strict overhead budget; sampling; profiling off by default and time-boxed; batch & compress at the edge.
Ingest hot spots	A node burst or a giant job skews Kafka partitions and lags consumers.	Partition by job/node; back-pressure-free buffering; autoscale consumers; drop lowest-value samples first under load.

Summary

GPU observability is the discipline of turning "why is my job slow?" into a precise chain of evidence — metric → trace → profile — across a fleet too large and too expensive to instrument naively. Wire the three pillars together with shared keys (job/rank/gpu_uuid/step), separate the three data shapes into a time-series DB, a trace store, and object storage, decouple producers with a streaming ingest bus, and spend your fidelity budget where it pays off: always-on cheap metrics, sampled traces, and on-demand deep profiles. Do that and you can answer why MFU is low, which rank is the straggler, and where the time goes — in seconds, at 100k GPUs, for a few percent of the compute bill.