AI / ML Infrastructure
Experiment Tracking & Hyperparameter Search
An experiment tracker is the system of record for every training run — the immutable lab notebook that logs the params, metrics, and artifacts of each experiment so teams can compare runs, reproduce results, and search for better hyperparameters. This is the control plane behind tools like MLflow, Weights & Biases, TensorBoard, Determined, Ray Tune, Optuna, and Vizier. Two hard problems collide here: a write-heavy telemetry firehose — metrics streaming from thousands of concurrent jobs that must never slow the GPU loop — and an active search controller that decides which configs to try next and kills bad trials early to keep an ultra-expensive fleet busy on the most promising runs.
Requirements
Treat the tracker as three workloads sharing one SDK: a write-heavy telemetry firehose (metrics streaming from live jobs), a read-heavy analytical surface (humans comparing runs, dashboards, leaderboards), and a control loop (the search) that both reads results and launches new runs. The defining constraint is that none of this may ever slow down training — GPU time is the expensive thing.
| Functional | Non-functional |
|---|---|
Log params, metrics & artifacts per run
via a thin SDK. Metrics are time series keyed by
step/wall_time; params are
immutable; artifacts are blobs (checkpoints, plots, media).
|
High-frequency ingestion — absorb metrics from thousands of concurrent runs logging every few steps without dropping points or stalling the training loop. |
| Compare runs side by side — overlay metric curves, diff hyperparameters, and group results into a leaderboard. |
Non-blocking client —
log() must return in microseconds; the SDK is
async and fault-tolerant so a tracker outage never crashes a
week-long job.
|
| Visualize metric curves, system charts (GPU util/mem), media (images/audio), and custom plots — even for very long runs. | Fast queries / dashboards — comparing tens-to-hundreds of runs loads in well under a second via server-side aggregation. |
| Group into experiments — organize runs into projects; tag, search, and filter by params, metrics, status, and tags. | Durable & consistent metadata — a finished run's params/artifacts/lineage are never lost or silently mutated. |
| Hyperparameter sweeps — define a search space + objective; the system suggests configs and orchestrates trials with early stopping. | Scalable & multi-tenant — grows with the fleet, isolates teams/quotas, and offers cheap long-term retention. |
| Reproducibility — capture code commit, data version, environment/container, and seeds so any run can be rebuilt and re-run. | Cost-efficient storage — compress time series and tier old data; dedup identical checkpoints; PB-scale artifacts stay affordable. |
Scale & estimates
Back-of-envelope to size the write path. The dominant load is metric points per second, not requests — each client flush carries many points. Assume a large org with active sweeps.
| Dimension | Estimate | Notes |
|---|---|---|
| Concurrent runs | ~10,000 | Across teams + big sweeps. A single sweep can be thousands of trials. |
| Series per run | ~100+ | ~30 user scalars (loss, lr, grad-norm, throughput) + ~10 system metrics × 8 GPUs (util/mem/temp/power). |
| Log cadence | flush every ~5-10 s | e.g. every 100 steps; batched — not one request per point. |
| Metric write rate | ~100k points/s sustained, ~500k/s peak | 10k runs × ~10 pts/s. The headline number that sizes the TSDB. |
| Raw point size | ~40 bytes |
(run, key, step, ts, value) → ~4 MB/s →
~350 GB/day before compression.
|
| After TS compression | ~1-2 bytes/point | Gorilla-style delta-of-delta + XOR float encoding → ~10-20× smaller. |
| Sweep size | 1k-10k trials each | Objective-driven; early stopping kills most before they finish. |
| Metadata writes | thousands/min | Run create/finish, param sets, tags — small, RDBMS-shaped, strongly consistent. |
| Artifacts | checkpoints 1-100 GB; media KB-MB | PB-scale over time on object storage; content-addressed dedup. |
| Retention | full-res 30-90 days; downsampled for years | Tiered; old runs rarely read at full resolution. |
| Query fan-out | 10-100 runs × N series / dashboard | Must aggregate + downsample server-side for sub-second loads. |
Takeaway: the write path is a time-series ingestion problem (~105 points/s) and the read path is an interactive analytics problem over millions of series. Treat them as two stores behind one API: a TSDB for metrics, an RDBMS for metadata, and object storage for artifacts — not one database trying to do all three.
Core entities
Six entities carry the whole model. An Experiment groups Runs; a Run owns immutable Params, emits a stream of Metrics, and produces Artifacts; a Sweep is the search job that spawns many Runs as trials. The critical split: metrics live in a TSDB, everything else in an RDBMS or blob store.
| Entity | What it is | Key fields |
|---|---|---|
| Experiment | A named project / logical grouping of related runs. |
experiment_id, name,
owner, created_at
|
| Run (Trial) | One training execution — the unit everything hangs off. |
run_id, experiment_id,
sweep_id?, status, times, provenance
|
| Param | Immutable hyperparameter of a run (set once at start). |
(run_id, key) → value,
type
|
| Metric | Append-only time-series point (the firehose). |
run_id, key, step,
wall_time, value
|
| Artifact | Content-addressed output blob (checkpoint, plot, table). |
digest, run_id, type,
storage_uri, size
|
| Sweep | A hyperparameter search job over a space + objective. |
sweep_id, strategy,
search_space, objective,
budget
|
A compact schema sketch — note that Metric is a
TSDB family, not an RDBMS row-per-insert:
Experiment
experiment_id PK, name, owner, description, created_at
UNIQUE (owner, name)
Run -- one training execution / trial
run_id PK, experiment_id FK -> Experiment,
sweep_id FK -> Sweep NULL,
status, -- running | finished | failed | killed
start_time, end_time,
git_commit, data_version, -- provenance (see Reproducibility)
container_image, seed,
tags JSON, created_by
Param -- immutable hyperparameters of a run
run_id FK -> Run, key, value, type
PK (run_id, key)
Metric -- append-only time series (hot write path)
series_id = hash(run_id, key), -- one contiguous curve per series
step, wall_time, value
-- stored in a columnar TSDB partitioned by (series_id, time),
-- NOT a row per insert in the RDBMS
Artifact -- content-addressed output blob
digest PK, -- sha256 of contents (dedup identical checkpoints)
run_id FK -> Run, type, -- checkpoint | image | table | file
storage_uri, size_bytes, created_at
Sweep -- hyperparameter search job
sweep_id PK, experiment_id FK -> Experiment,
strategy, -- grid | random | bayes | asha | bohb
search_space JSON, objective, -- e.g. minimize val_loss
budget, max_trials, status, config JSON
Immutability is the simplifying assumption
A finished run's
params, final metrics, and artifacts are immutable
— only status and tags mutate while
running. That means resolved runs and their metric history can be
cached forever, comparisons are stable, and the
record is auditable. Build the whole read side around this.
High-level design
Training jobs talk to a thin SDK that buffers and batches locally. Batches hit a stateless ingestion API that appends to a durable write buffer (log) and acks fast; consumers fan the data into the right store — metrics to a time-series store, run/param metadata to an RDBMS, and large blobs straight to an artifact store. A query + aggregation service powers the dashboard and compare UI.
flowchart LR
T1["Training job + SDK"] --> ING["Ingestion API (async, batched)"]
T2["Training job + SDK"] --> ING
ING --> Q["Write buffer / queue"]
Q --> TS["Time-series store (metrics)"]
Q --> META["Metadata DB (runs, params)"]
ING --> OBJ["Artifact / blob store"]
TS --> QRY["Query + aggregation service"]
META --> QRY
QRY --> UI["Dashboard / compare UI"]
OBJ --> UI
- Thin client SDK. Buffers points on a background thread, batches, and retries; logging is fire-and-forget so a tracker outage never crashes a long training job. It also captures provenance (git commit, data version, image, seeds) at run start.
-
Ingestion API (stateless, autoscaled).
Authenticates, validates, and appends batches to a partitioned
log (Kafka) keyed by
run_id, then acks. This decouples bursty producers from the storage write rate and absorbs a sweep launching 1,000 trials at once. - Three stores, one API. Metrics → columnar TSDB; experiment/run/param metadata → RDBMS (Postgres); artifacts → object storage (S3) fronted by a CDN/peer mesh for big checkpoints. Artifacts are content-addressed so identical checkpoints dedup.
- Query + aggregation service. Does server-side downsampling and roll-ups so the browser fetches a few hundred points per series, not millions, and caches immutable historical windows.
- Search controller. A separate control loop reads trial results and launches/kills runs through the GPU cluster scheduler — detailed in the Hyperparameter search section.
Deep dive: metric ingestion at scale
The firehose is the hard part. A run logging every step can emit thousands of points per minute; multiply by thousands of runs and you are at ~105 points/s. Two rules drive every decision: never block the training loop, and never lose a finished run's record. The client absorbs the first; the ingestion log and TSDB absorb the second.
sequenceDiagram
participant R as Training run SDK
participant B as Local buffer
participant API as Ingestion API
participant Q as Write buffer
participant TS as Time-series store
R->>B: log step + metrics, non-blocking
B->>B: Batch and coalesce points
B->>API: Flush batch every ~5s
API->>Q: Append and ack fast
Q-->>API: Buffered durably
API-->>B: 200 OK
Q->>TS: Bulk write + compress
Note over B,API: If network down, buffer locally and retry
-
Async, batched client.
log()appends to an in-memory ring buffer and returns immediately. A background thread coalesces points and flushes a compact batch every few seconds (or every N points). If the network is down it buffers locally (spilling to disk) and retries with backoff — training keeps running. On run end it flushes synchronously so the final state is durable. -
Ingestion decoupling. The API appends batches to a
partitioned log keyed by
run_id; consumers write to the TSDB at their own pace. This absorbs bursts and enables back-pressure: if consumers lag, producers slow their flush cadence — never the training loop. -
Time-series storage. Each curve is a series
series_id = hash(run_id, key), stored as(series_id, step/timestamp, value). A columnar TSDB applies delta-of-delta timestamp encoding + XOR float compression (Gorilla-style) for ~1-2 bytes/point. Partition by series and time; recent data in a hot tier (memory/SSD), older data tiered to object storage. Keeping each curve contiguous makes range scans cheap. - System vs user metrics. User metrics (loss, accuracy) come from training code at step cadence. System metrics (GPU util/mem/temp/ power, throughput, host CPU/RAM) are sampled by a sidecar/agent on a fixed wall-clock timer, independent of the loop — so they keep flowing even if the loop hangs, which is gold for debugging stalls. Keep them as separate series families: higher volume, lower long-term value → shorter retention.
- Downsampling for long runs. A week-long run logging every 100 steps is millions of points — a browser can't render that and shouldn't receive it. Maintain pre-aggregated roll-ups (min/max/avg per bucket) at write or compaction time, and serve the resolution that fits the viewport (~1-2k points). Preserve min/max so loss spikes aren't smoothed away; keep full resolution hot for a window, then down-sample and tier to cold storage.
-
Idempotency & ordering. Points carry
(run_id, key, step)so retries dedup; the store is append-only and tolerates out-of-order arrival (sort by step on read), with last-writer-wins per(series, step).
Trade-off: async batching vs durability
Async batching trades a few seconds of crash-window metrics for zero training-loop impact — the right call, because the expensive resource (GPU time) must never wait on the tracker. The durability guarantee therefore lives on params and artifacts, written/flushed deliberately, while raw metric points are treated as best-effort telemetry. Choosing where the strong guarantee sits is the key decision.
Deep dive: hyperparameter search
A sweep turns the tracker from a passive recorder into an active
optimizer. Given a search space and an objective
(e.g. minimize val_loss), the controller decides
which configs to try next and
which running trials to kill — and the second
is where the real GPU savings come from. The strategies trade
sample efficiency against
parallelism.
| Strategy | How it picks configs | When to use | Cost / trait |
|---|---|---|---|
| Grid search | Every combination of a discrete grid. | Few params (≤3); need exhaustive coverage. | Explodes combinatorially; wastes budget on dead dimensions. |
| Random search | Sample uniformly from the space. | Strong default; beats grid when only a few params matter. | Embarrassingly parallel; learns nothing from past trials. |
| Bayesian opt (TPE/GP) | Fit a surrogate of objective vs params; sample where expected improvement is high. | Expensive objectives; want the answer in few trials. | Sample-efficient but more sequential; harder to parallelize. |
| Hyperband / ASHA | Random configs + successive halving: run many cheaply, promote the top fraction to more budget. | Large parallel fleets with a meaningful early signal. | Best throughput; needs early metric to correlate with final. |
-
Successive halving / ASHA. Give a small budget
(epochs/steps) to many configs; at each rung keep
the top
1/η(e.g. top third) and grant survivorsη×more budget; repeat. Most bad configs die after minutes, concentrating GPUs on promising ones. ASHA (Asynchronous Successive Halving) makes promotion asynchronous — promote as soon as a rung has enough finished trials, so no trial blocks waiting on slow siblings — which is what scales to thousands of parallel workers. - The controller / scheduler. A central controller holds search state, suggests configs via the chosen strategy, and watches each trial's reported metric at rung boundaries. When a trial underperforms its rung cutoff it issues a kill and reclaims the GPUs. It is stateless-recoverable: search state is persisted in the metadata DB, so the controller can crash and resume without losing the sweep.
- Cross-ref: GPU scheduler. The HPO controller doesn't place GPUs itself — it submits and cancels trials to the GPU cluster scheduler, which gang-schedules each trial and reclaims freed GPUs for the next suggestion. Early-stopping plus the scheduler's backfill together keep the expensive fleet busy on the most promising configs.
flowchart TD
SP["Search space + strategy"] --> CTRL["HPO controller"]
CTRL --> SUG["Suggest next configs"]
SUG --> L1["Trial 1"]
SUG --> L2["Trial 2"]
SUG --> L3["Trial N"]
L1 --> EVAL["Report metric at rung"]
L2 --> EVAL
L3 --> EVAL
EVAL --> ASHA{"Promote or stop?"}
ASHA -->|top fraction| PROMO["More budget, next rung"]
ASHA -->|rest| KILL["Early-stop, free GPUs"]
PROMO --> CTRL
KILL --> SCHED["GPU scheduler reclaims GPUs"]
SCHED --> CTRL
Trade-off: parallelism vs sample efficiency
Bayesian optimization is sample-efficient but wants results before suggesting the next config (sequential); ASHA is parallel-friendly but assumes the early metric tracks final performance. BOHB combines Bayesian suggestion with Hyperband budgets to get both. The hidden risk: early stopping fails when early signal is misleading (e.g. LR warmup or cosine schedules) — guard it with a minimum budget per rung.
Reproducibility & governance
A metric is worthless if you can't reproduce the run that produced it. Reproducibility means capturing everything that influences the result at run start, automatically, so "rerun X" is one click and "why did this metric move?" is answerable. The tracker pins it all to the immutable Run record.
| Dimension | What's pinned | How |
|---|---|---|
| Code | Exact git commit + uncommitted diff. | SDK reads the git SHA at start; warns on and stores any dirty patch. |
| Data | Dataset version / snapshot id. | Content hash or dataset-version pointer (cross-ref the ML data pipeline / model registry). |
| Environment | Container image digest, lockfile, CUDA/driver versions. |
Pin the image by sha256 digest, never a mutable
tag.
|
| Config | Full resolved hyperparameters (immutable Params). | Logged at run start; never mutated thereafter. |
| Hardware | GPU type/count and topology. | From the scheduler; affects throughput and sometimes numerics. |
| Randomness | All seeds (framework, dataloader, CUDA) + determinism flags. | Logged; note exact bitwise determinism on GPU is best-effort. |
- Lineage / provenance. Each run links upstream (dataset version, parent run for fine-tunes, code commit) and downstream (artifacts, registered model versions). This DAG answers "what produced this model?" and "what's affected if this dataset is bad?", and feeds the model registry at promotion time.
- Comparison & leaderboards. The read surface: filter runs by params/tags, overlay metric curves, diff configs (which hyperparameter changed?), and rank by objective into a leaderboard per experiment. Parallel-coordinates plots reveal which params drive the metric. Because params and final metrics are immutable, these queries cache extremely well.
- Governance. Immutability + audit: a finished run is a permanent record of who ran what, when, and on which data. Per-experiment/team access control enables multi-tenancy; retention and quota policies bound cost.
The durability boundary
Params, artifacts, and lineage are the durable system of record — strongly consistent, immutable, audited. Raw metric points are best-effort telemetry — high-volume, compressed, downsampled, eventually tiered away. Different guarantees, different stores: that separation is what lets the same system be both trustworthy and cheap at 105 points/s.
Bottlenecks & scaling
A tracker rarely dies on raw QPS — it degrades through metric write volume, query fan-out over many runs, high-cardinality series, and artifact cost. The mitigations keep the firehose flowing while keeping dashboards interactive.
| Bottleneck | Why it hurts | Mitigation |
|---|---|---|
| Metric write volume | ~105+ points/s swamps a naive row-per-insert DB. | Batch on the client; ingest via a partitioned log; columnar TSDB with delta + XOR compression; shard by series. |
| Query latency over many runs | Comparing 100 runs × millions of points/series melts dashboards. | Server-side downsampling/roll-ups; pre-aggregate; resolution-by-viewport; cache immutable windows; cap series per chart. |
| High-cardinality series | Per-step, per-GPU, per-layer metrics explode series count → index blowup. | Curate/cap keys; separate system-metric family with shorter retention; tag instead of minting new series; reject runaway cardinality. |
| Artifact storage | Checkpoints (1-100 GB) × many runs → PB and real money. | Content-addressed dedup; tier hot → cold; TTL on non-promoted runs; presigned URLs + CDN/peer mesh for downloads. |
| Sweep orchestration | Thousands of trials thrash the controller + scheduler. | Async ASHA (no global barrier); persist search state; rate-limit suggestions; delegate placement to the GPU scheduler; backfill freed GPUs. |
| Hot run / write skew | A few giant runs (thousands of series) hotspot one partition. | Hash series across partitions; per-series shard key; isolate noisy tenants from shared infra. |
| Metadata contention | Frequent status/tag updates + heavy list/filter queries on one RDBMS. | Read replicas; index by experiment/tag/metric; split hot mutable status from immutable params; cache alias/resolve reads. |
Summary — what a staff answer nails
Frame it as two systems behind one SDK: a write-optimized telemetry firehose (async batched client → partitioned log → compressed columnar TSDB at ~105 points/s, downsampled and tiered) and a read-optimized analytics + control plane (metadata RDBMS + artifact store powering compare/leaderboards and a search controller). Make the client non-blocking so the tracker never steals GPU time, and put the durability guarantee on params/artifacts/lineage while treating raw metrics as best-effort. For HPO, lead with ASHA/Hyperband early-stopping to concentrate GPUs on promising trials, add Bayesian suggestion for sample efficiency, and delegate placement to the GPU scheduler. Close on reproducibility — pin code, data, container digest, and seeds — because an experiment you can't reproduce isn't an experiment.