AI / ML Infrastructure

Experiment Tracking & Hyperparameter Search

An experiment tracker is the system of record for every training run — the immutable lab notebook that logs the params, metrics, and artifacts of each experiment so teams can compare runs, reproduce results, and search for better hyperparameters. This is the control plane behind tools like MLflow, Weights & Biases, TensorBoard, Determined, Ray Tune, Optuna, and Vizier. Two hard problems collide here: a write-heavy telemetry firehose — metrics streaming from thousands of concurrent jobs that must never slow the GPU loop — and an active search controller that decides which configs to try next and kills bad trials early to keep an ultra-expensive fleet busy on the most promising runs.

Requirements

Treat the tracker as three workloads sharing one SDK: a write-heavy telemetry firehose (metrics streaming from live jobs), a read-heavy analytical surface (humans comparing runs, dashboards, leaderboards), and a control loop (the search) that both reads results and launches new runs. The defining constraint is that none of this may ever slow down training — GPU time is the expensive thing.

Functional	Non-functional
Log params, metrics & artifacts per run via a thin SDK. Metrics are time series keyed by `step`/`wall_time`; params are immutable; artifacts are blobs (checkpoints, plots, media).	High-frequency ingestion — absorb metrics from thousands of concurrent runs logging every few steps without dropping points or stalling the training loop.
Compare runs side by side — overlay metric curves, diff hyperparameters, and group results into a leaderboard.	Non-blocking client — `log()` must return in microseconds; the SDK is async and fault-tolerant so a tracker outage never crashes a week-long job.
Visualize metric curves, system charts (GPU util/mem), media (images/audio), and custom plots — even for very long runs.	Fast queries / dashboards — comparing tens-to-hundreds of runs loads in well under a second via server-side aggregation.
Group into experiments — organize runs into projects; tag, search, and filter by params, metrics, status, and tags.	Durable & consistent metadata — a finished run's params/artifacts/lineage are never lost or silently mutated.
Hyperparameter sweeps — define a search space + objective; the system suggests configs and orchestrates trials with early stopping.	Scalable & multi-tenant — grows with the fleet, isolates teams/quotas, and offers cheap long-term retention.
Reproducibility — capture code commit, data version, environment/container, and seeds so any run can be rebuilt and re-run.	Cost-efficient storage — compress time series and tier old data; dedup identical checkpoints; PB-scale artifacts stay affordable.

Scale & estimates

Back-of-envelope to size the write path. The dominant load is metric points per second, not requests — each client flush carries many points. Assume a large org with active sweeps.

Dimension	Estimate	Notes
Concurrent runs	~10,000	Across teams + big sweeps. A single sweep can be thousands of trials.
Series per run	~100+	~30 user scalars (loss, lr, grad-norm, throughput) + ~10 system metrics × 8 GPUs (util/mem/temp/power).
Log cadence	flush every ~5-10 s	e.g. every 100 steps; batched — not one request per point.
Metric write rate	~100k points/s sustained, ~500k/s peak	10k runs × ~10 pts/s. The headline number that sizes the TSDB.
Raw point size	~40 bytes	`(run, key, step, ts, value)` → ~4 MB/s → ~350 GB/day before compression.
After TS compression	~1-2 bytes/point	Gorilla-style delta-of-delta + XOR float encoding → ~10-20× smaller.
Sweep size	1k-10k trials each	Objective-driven; early stopping kills most before they finish.
Metadata writes	thousands/min	Run create/finish, param sets, tags — small, RDBMS-shaped, strongly consistent.
Artifacts	checkpoints 1-100 GB; media KB-MB	PB-scale over time on object storage; content-addressed dedup.
Retention	full-res 30-90 days; downsampled for years	Tiered; old runs rarely read at full resolution.
Query fan-out	10-100 runs × N series / dashboard	Must aggregate + downsample server-side for sub-second loads.

Takeaway: the write path is a time-series ingestion problem (~10⁵ points/s) and the read path is an interactive analytics problem over millions of series. Treat them as two stores behind one API: a TSDB for metrics, an RDBMS for metadata, and object storage for artifacts — not one database trying to do all three.

Core entities

Six entities carry the whole model. An Experiment groups Runs; a Run owns immutable Params, emits a stream of Metrics, and produces Artifacts; a Sweep is the search job that spawns many Runs as trials. The critical split: metrics live in a TSDB, everything else in an RDBMS or blob store.

Entity	What it is	Key fields
Experiment	A named project / logical grouping of related runs.	`experiment_id`, `name`, `owner`, `created_at`
Run (Trial)	One training execution — the unit everything hangs off.	`run_id`, `experiment_id`, `sweep_id?`, `status`, times, provenance
Param	Immutable hyperparameter of a run (set once at start).	`(run_id, key)` → `value`, `type`
Metric	Append-only time-series point (the firehose).	`run_id`, `key`, `step`, `wall_time`, `value`
Artifact	Content-addressed output blob (checkpoint, plot, table).	`digest`, `run_id`, `type`, `storage_uri`, `size`
Sweep	A hyperparameter search job over a space + objective.	`sweep_id`, `strategy`, `search_space`, `objective`, `budget`

A compact schema sketch — note that Metric is a TSDB family, not an RDBMS row-per-insert:

Experiment
  experiment_id PK, name, owner, description, created_at
  UNIQUE (owner, name)

Run                                  -- one training execution / trial
  run_id PK, experiment_id FK -> Experiment,
  sweep_id FK -> Sweep NULL,
  status,                            -- running | finished | failed | killed
  start_time, end_time,
  git_commit, data_version,          -- provenance (see Reproducibility)
  container_image, seed,
  tags JSON, created_by

Param                                -- immutable hyperparameters of a run
  run_id FK -> Run, key, value, type
  PK (run_id, key)

Metric                               -- append-only time series (hot write path)
  series_id = hash(run_id, key),     -- one contiguous curve per series
  step, wall_time, value
  -- stored in a columnar TSDB partitioned by (series_id, time),
  -- NOT a row per insert in the RDBMS

Artifact                             -- content-addressed output blob
  digest PK,                         -- sha256 of contents (dedup identical checkpoints)
  run_id FK -> Run, type,            -- checkpoint | image | table | file
  storage_uri, size_bytes, created_at

Sweep                                -- hyperparameter search job
  sweep_id PK, experiment_id FK -> Experiment,
  strategy,                          -- grid | random | bayes | asha | bohb
  search_space JSON, objective,      -- e.g. minimize val_loss
  budget, max_trials, status, config JSON

Immutability is the simplifying assumption

A finished run's params, final metrics, and artifacts are immutable — only status and tags mutate while running. That means resolved runs and their metric history can be cached forever, comparisons are stable, and the record is auditable. Build the whole read side around this.

High-level design

Training jobs talk to a thin SDK that buffers and batches locally. Batches hit a stateless ingestion API that appends to a durable write buffer (log) and acks fast; consumers fan the data into the right store — metrics to a time-series store, run/param metadata to an RDBMS, and large blobs straight to an artifact store. A query + aggregation service powers the dashboard and compare UI.

flowchart LR
    T1["Training job + SDK"] --> ING["Ingestion API (async, batched)"]
    T2["Training job + SDK"] --> ING
    ING --> Q["Write buffer / queue"]
    Q --> TS["Time-series store (metrics)"]
    Q --> META["Metadata DB (runs, params)"]
    ING --> OBJ["Artifact / blob store"]
    TS --> QRY["Query + aggregation service"]
    META --> QRY
    QRY --> UI["Dashboard / compare UI"]
    OBJ --> UI

Thin client SDK. Buffers points on a background thread, batches, and retries; logging is fire-and-forget so a tracker outage never crashes a long training job. It also captures provenance (git commit, data version, image, seeds) at run start.
Ingestion API (stateless, autoscaled). Authenticates, validates, and appends batches to a partitioned log (Kafka) keyed by run_id, then acks. This decouples bursty producers from the storage write rate and absorbs a sweep launching 1,000 trials at once.
Three stores, one API. Metrics → columnar TSDB; experiment/run/param metadata → RDBMS (Postgres); artifacts → object storage (S3) fronted by a CDN/peer mesh for big checkpoints. Artifacts are content-addressed so identical checkpoints dedup.
Query + aggregation service. Does server-side downsampling and roll-ups so the browser fetches a few hundred points per series, not millions, and caches immutable historical windows.
Search controller. A separate control loop reads trial results and launches/kills runs through the GPU cluster scheduler — detailed in the Hyperparameter search section.

Deep dive: metric ingestion at scale

The firehose is the hard part. A run logging every step can emit thousands of points per minute; multiply by thousands of runs and you are at ~10⁵ points/s. Two rules drive every decision: never block the training loop, and never lose a finished run's record. The client absorbs the first; the ingestion log and TSDB absorb the second.

sequenceDiagram
    participant R as Training run SDK
    participant B as Local buffer
    participant API as Ingestion API
    participant Q as Write buffer
    participant TS as Time-series store
    R->>B: log step + metrics, non-blocking
    B->>B: Batch and coalesce points
    B->>API: Flush batch every ~5s
    API->>Q: Append and ack fast
    Q-->>API: Buffered durably
    API-->>B: 200 OK
    Q->>TS: Bulk write + compress
    Note over B,API: If network down, buffer locally and retry

Async, batched client. log() appends to an in-memory ring buffer and returns immediately. A background thread coalesces points and flushes a compact batch every few seconds (or every N points). If the network is down it buffers locally (spilling to disk) and retries with backoff — training keeps running. On run end it flushes synchronously so the final state is durable.
Ingestion decoupling. The API appends batches to a partitioned log keyed by run_id; consumers write to the TSDB at their own pace. This absorbs bursts and enables back-pressure: if consumers lag, producers slow their flush cadence — never the training loop.
Time-series storage. Each curve is a series series_id = hash(run_id, key), stored as (series_id, step/timestamp, value). A columnar TSDB applies delta-of-delta timestamp encoding + XOR float compression (Gorilla-style) for ~1-2 bytes/point. Partition by series and time; recent data in a hot tier (memory/SSD), older data tiered to object storage. Keeping each curve contiguous makes range scans cheap.
System vs user metrics. User metrics (loss, accuracy) come from training code at step cadence. System metrics (GPU util/mem/temp/ power, throughput, host CPU/RAM) are sampled by a sidecar/agent on a fixed wall-clock timer, independent of the loop — so they keep flowing even if the loop hangs, which is gold for debugging stalls. Keep them as separate series families: higher volume, lower long-term value → shorter retention.
Downsampling for long runs. A week-long run logging every 100 steps is millions of points — a browser can't render that and shouldn't receive it. Maintain pre-aggregated roll-ups (min/max/avg per bucket) at write or compaction time, and serve the resolution that fits the viewport (~1-2k points). Preserve min/max so loss spikes aren't smoothed away; keep full resolution hot for a window, then down-sample and tier to cold storage.
Idempotency & ordering. Points carry (run_id, key, step) so retries dedup; the store is append-only and tolerates out-of-order arrival (sort by step on read), with last-writer-wins per (series, step).

Trade-off: async batching vs durability

Async batching trades a few seconds of crash-window metrics for zero training-loop impact — the right call, because the expensive resource (GPU time) must never wait on the tracker. The durability guarantee therefore lives on params and artifacts, written/flushed deliberately, while raw metric points are treated as best-effort telemetry. Choosing where the strong guarantee sits is the key decision.

Deep dive: hyperparameter search

A sweep turns the tracker from a passive recorder into an active optimizer. Given a search space and an objective (e.g. minimize val_loss), the controller decides which configs to try next and which running trials to kill — and the second is where the real GPU savings come from. The strategies trade sample efficiency against parallelism.

Strategy	How it picks configs	When to use	Cost / trait
Grid search	Every combination of a discrete grid.	Few params (≤3); need exhaustive coverage.	Explodes combinatorially; wastes budget on dead dimensions.
Random search	Sample uniformly from the space.	Strong default; beats grid when only a few params matter.	Embarrassingly parallel; learns nothing from past trials.
Bayesian opt (TPE/GP)	Fit a surrogate of objective vs params; sample where expected improvement is high.	Expensive objectives; want the answer in few trials.	Sample-efficient but more sequential; harder to parallelize.
Hyperband / ASHA	Random configs + successive halving: run many cheaply, promote the top fraction to more budget.	Large parallel fleets with a meaningful early signal.	Best throughput; needs early metric to correlate with final.

Successive halving / ASHA. Give a small budget (epochs/steps) to many configs; at each rung keep the top 1/η (e.g. top third) and grant survivors η× more budget; repeat. Most bad configs die after minutes, concentrating GPUs on promising ones. ASHA (Asynchronous Successive Halving) makes promotion asynchronous — promote as soon as a rung has enough finished trials, so no trial blocks waiting on slow siblings — which is what scales to thousands of parallel workers.
The controller / scheduler. A central controller holds search state, suggests configs via the chosen strategy, and watches each trial's reported metric at rung boundaries. When a trial underperforms its rung cutoff it issues a kill and reclaims the GPUs. It is stateless-recoverable: search state is persisted in the metadata DB, so the controller can crash and resume without losing the sweep.
Cross-ref: GPU scheduler. The HPO controller doesn't place GPUs itself — it submits and cancels trials to the GPU cluster scheduler, which gang-schedules each trial and reclaims freed GPUs for the next suggestion. Early-stopping plus the scheduler's backfill together keep the expensive fleet busy on the most promising configs.

flowchart TD
    SP["Search space + strategy"] --> CTRL["HPO controller"]
    CTRL --> SUG["Suggest next configs"]
    SUG --> L1["Trial 1"]
    SUG --> L2["Trial 2"]
    SUG --> L3["Trial N"]
    L1 --> EVAL["Report metric at rung"]
    L2 --> EVAL
    L3 --> EVAL
    EVAL --> ASHA{"Promote or stop?"}
    ASHA -->|top fraction| PROMO["More budget, next rung"]
    ASHA -->|rest| KILL["Early-stop, free GPUs"]
    PROMO --> CTRL
    KILL --> SCHED["GPU scheduler reclaims GPUs"]
    SCHED --> CTRL

Trade-off: parallelism vs sample efficiency

Bayesian optimization is sample-efficient but wants results before suggesting the next config (sequential); ASHA is parallel-friendly but assumes the early metric tracks final performance. BOHB combines Bayesian suggestion with Hyperband budgets to get both. The hidden risk: early stopping fails when early signal is misleading (e.g. LR warmup or cosine schedules) — guard it with a minimum budget per rung.

Reproducibility & governance

A metric is worthless if you can't reproduce the run that produced it. Reproducibility means capturing everything that influences the result at run start, automatically, so "rerun X" is one click and "why did this metric move?" is answerable. The tracker pins it all to the immutable Run record.

Dimension	What's pinned	How
Code	Exact git commit + uncommitted diff.	SDK reads the git SHA at start; warns on and stores any dirty patch.
Data	Dataset version / snapshot id.	Content hash or dataset-version pointer (cross-ref the ML data pipeline / model registry).
Environment	Container image digest, lockfile, CUDA/driver versions.	Pin the image by `sha256` digest, never a mutable tag.
Config	Full resolved hyperparameters (immutable Params).	Logged at run start; never mutated thereafter.
Hardware	GPU type/count and topology.	From the scheduler; affects throughput and sometimes numerics.
Randomness	All seeds (framework, dataloader, CUDA) + determinism flags.	Logged; note exact bitwise determinism on GPU is best-effort.

Lineage / provenance. Each run links upstream (dataset version, parent run for fine-tunes, code commit) and downstream (artifacts, registered model versions). This DAG answers "what produced this model?" and "what's affected if this dataset is bad?", and feeds the model registry at promotion time.
Comparison & leaderboards. The read surface: filter runs by params/tags, overlay metric curves, diff configs (which hyperparameter changed?), and rank by objective into a leaderboard per experiment. Parallel-coordinates plots reveal which params drive the metric. Because params and final metrics are immutable, these queries cache extremely well.
Governance. Immutability + audit: a finished run is a permanent record of who ran what, when, and on which data. Per-experiment/team access control enables multi-tenancy; retention and quota policies bound cost.

The durability boundary

Params, artifacts, and lineage are the durable system of record — strongly consistent, immutable, audited. Raw metric points are best-effort telemetry — high-volume, compressed, downsampled, eventually tiered away. Different guarantees, different stores: that separation is what lets the same system be both trustworthy and cheap at 10⁵ points/s.

Bottlenecks & scaling

A tracker rarely dies on raw QPS — it degrades through metric write volume, query fan-out over many runs, high-cardinality series, and artifact cost. The mitigations keep the firehose flowing while keeping dashboards interactive.

Bottleneck	Why it hurts	Mitigation
Metric write volume	~10⁵+ points/s swamps a naive row-per-insert DB.	Batch on the client; ingest via a partitioned log; columnar TSDB with delta + XOR compression; shard by series.
Query latency over many runs	Comparing 100 runs × millions of points/series melts dashboards.	Server-side downsampling/roll-ups; pre-aggregate; resolution-by-viewport; cache immutable windows; cap series per chart.
High-cardinality series	Per-step, per-GPU, per-layer metrics explode series count → index blowup.	Curate/cap keys; separate system-metric family with shorter retention; tag instead of minting new series; reject runaway cardinality.
Artifact storage	Checkpoints (1-100 GB) × many runs → PB and real money.	Content-addressed dedup; tier hot → cold; TTL on non-promoted runs; presigned URLs + CDN/peer mesh for downloads.
Sweep orchestration	Thousands of trials thrash the controller + scheduler.	Async ASHA (no global barrier); persist search state; rate-limit suggestions; delegate placement to the GPU scheduler; backfill freed GPUs.
Hot run / write skew	A few giant runs (thousands of series) hotspot one partition.	Hash series across partitions; per-series shard key; isolate noisy tenants from shared infra.
Metadata contention	Frequent status/tag updates + heavy list/filter queries on one RDBMS.	Read replicas; index by experiment/tag/metric; split hot mutable status from immutable params; cache alias/resolve reads.

Summary — what a staff answer nails

Frame it as two systems behind one SDK: a write-optimized telemetry firehose (async batched client → partitioned log → compressed columnar TSDB at ~10⁵ points/s, downsampled and tiered) and a read-optimized analytics + control plane (metadata RDBMS + artifact store powering compare/leaderboards and a search controller). Make the client non-blocking so the tracker never steals GPU time, and put the durability guarantee on params/artifacts/lineage while treating raw metrics as best-effort. For HPO, lead with ASHA/Hyperband early-stopping to concentrate GPUs on promising trials, add Bayesian suggestion for sample efficiency, and delegate placement to the GPU scheduler. Close on reproducibility — pin code, data, container digest, and seeds — because an experiment you can't reproduce isn't an experiment.