System Design Notes All designs

AI / ML Infrastructure

Experiment Tracking & Hyperparameter Search

An experiment tracker is the system of record for every training run — the immutable lab notebook that logs the params, metrics, and artifacts of each experiment so teams can compare runs, reproduce results, and search for better hyperparameters. This is the control plane behind tools like MLflow, Weights & Biases, TensorBoard, Determined, Ray Tune, Optuna, and Vizier. Two hard problems collide here: a write-heavy telemetry firehose — metrics streaming from thousands of concurrent jobs that must never slow the GPU loop — and an active search controller that decides which configs to try next and kills bad trials early to keep an ultra-expensive fleet busy on the most promising runs.

Requirements

Treat the tracker as three workloads sharing one SDK: a write-heavy telemetry firehose (metrics streaming from live jobs), a read-heavy analytical surface (humans comparing runs, dashboards, leaderboards), and a control loop (the search) that both reads results and launches new runs. The defining constraint is that none of this may ever slow down training — GPU time is the expensive thing.

Functional Non-functional
Log params, metrics & artifacts per run via a thin SDK. Metrics are time series keyed by step/wall_time; params are immutable; artifacts are blobs (checkpoints, plots, media). High-frequency ingestion — absorb metrics from thousands of concurrent runs logging every few steps without dropping points or stalling the training loop.
Compare runs side by side — overlay metric curves, diff hyperparameters, and group results into a leaderboard. Non-blocking clientlog() must return in microseconds; the SDK is async and fault-tolerant so a tracker outage never crashes a week-long job.
Visualize metric curves, system charts (GPU util/mem), media (images/audio), and custom plots — even for very long runs. Fast queries / dashboards — comparing tens-to-hundreds of runs loads in well under a second via server-side aggregation.
Group into experiments — organize runs into projects; tag, search, and filter by params, metrics, status, and tags. Durable & consistent metadata — a finished run's params/artifacts/lineage are never lost or silently mutated.
Hyperparameter sweeps — define a search space + objective; the system suggests configs and orchestrates trials with early stopping. Scalable & multi-tenant — grows with the fleet, isolates teams/quotas, and offers cheap long-term retention.
Reproducibility — capture code commit, data version, environment/container, and seeds so any run can be rebuilt and re-run. Cost-efficient storage — compress time series and tier old data; dedup identical checkpoints; PB-scale artifacts stay affordable.

Scale & estimates

Back-of-envelope to size the write path. The dominant load is metric points per second, not requests — each client flush carries many points. Assume a large org with active sweeps.

Dimension Estimate Notes
Concurrent runs ~10,000 Across teams + big sweeps. A single sweep can be thousands of trials.
Series per run ~100+ ~30 user scalars (loss, lr, grad-norm, throughput) + ~10 system metrics × 8 GPUs (util/mem/temp/power).
Log cadence flush every ~5-10 s e.g. every 100 steps; batched — not one request per point.
Metric write rate ~100k points/s sustained, ~500k/s peak 10k runs × ~10 pts/s. The headline number that sizes the TSDB.
Raw point size ~40 bytes (run, key, step, ts, value) → ~4 MB/s → ~350 GB/day before compression.
After TS compression ~1-2 bytes/point Gorilla-style delta-of-delta + XOR float encoding → ~10-20× smaller.
Sweep size 1k-10k trials each Objective-driven; early stopping kills most before they finish.
Metadata writes thousands/min Run create/finish, param sets, tags — small, RDBMS-shaped, strongly consistent.
Artifacts checkpoints 1-100 GB; media KB-MB PB-scale over time on object storage; content-addressed dedup.
Retention full-res 30-90 days; downsampled for years Tiered; old runs rarely read at full resolution.
Query fan-out 10-100 runs × N series / dashboard Must aggregate + downsample server-side for sub-second loads.

Takeaway: the write path is a time-series ingestion problem (~105 points/s) and the read path is an interactive analytics problem over millions of series. Treat them as two stores behind one API: a TSDB for metrics, an RDBMS for metadata, and object storage for artifacts — not one database trying to do all three.

Core entities

Six entities carry the whole model. An Experiment groups Runs; a Run owns immutable Params, emits a stream of Metrics, and produces Artifacts; a Sweep is the search job that spawns many Runs as trials. The critical split: metrics live in a TSDB, everything else in an RDBMS or blob store.

Entity What it is Key fields
Experiment A named project / logical grouping of related runs. experiment_id, name, owner, created_at
Run (Trial) One training execution — the unit everything hangs off. run_id, experiment_id, sweep_id?, status, times, provenance
Param Immutable hyperparameter of a run (set once at start). (run_id, key)value, type
Metric Append-only time-series point (the firehose). run_id, key, step, wall_time, value
Artifact Content-addressed output blob (checkpoint, plot, table). digest, run_id, type, storage_uri, size
Sweep A hyperparameter search job over a space + objective. sweep_id, strategy, search_space, objective, budget

A compact schema sketch — note that Metric is a TSDB family, not an RDBMS row-per-insert:

Experiment
  experiment_id PK, name, owner, description, created_at
  UNIQUE (owner, name)

Run                                  -- one training execution / trial
  run_id PK, experiment_id FK -> Experiment,
  sweep_id FK -> Sweep NULL,
  status,                            -- running | finished | failed | killed
  start_time, end_time,
  git_commit, data_version,          -- provenance (see Reproducibility)
  container_image, seed,
  tags JSON, created_by

Param                                -- immutable hyperparameters of a run
  run_id FK -> Run, key, value, type
  PK (run_id, key)

Metric                               -- append-only time series (hot write path)
  series_id = hash(run_id, key),     -- one contiguous curve per series
  step, wall_time, value
  -- stored in a columnar TSDB partitioned by (series_id, time),
  -- NOT a row per insert in the RDBMS

Artifact                             -- content-addressed output blob
  digest PK,                         -- sha256 of contents (dedup identical checkpoints)
  run_id FK -> Run, type,            -- checkpoint | image | table | file
  storage_uri, size_bytes, created_at

Sweep                                -- hyperparameter search job
  sweep_id PK, experiment_id FK -> Experiment,
  strategy,                          -- grid | random | bayes | asha | bohb
  search_space JSON, objective,      -- e.g. minimize val_loss
  budget, max_trials, status, config JSON

Immutability is the simplifying assumption

A finished run's params, final metrics, and artifacts are immutable — only status and tags mutate while running. That means resolved runs and their metric history can be cached forever, comparisons are stable, and the record is auditable. Build the whole read side around this.

High-level design

Training jobs talk to a thin SDK that buffers and batches locally. Batches hit a stateless ingestion API that appends to a durable write buffer (log) and acks fast; consumers fan the data into the right store — metrics to a time-series store, run/param metadata to an RDBMS, and large blobs straight to an artifact store. A query + aggregation service powers the dashboard and compare UI.

flowchart LR
    T1["Training job + SDK"] --> ING["Ingestion API (async, batched)"]
    T2["Training job + SDK"] --> ING
    ING --> Q["Write buffer / queue"]
    Q --> TS["Time-series store (metrics)"]
    Q --> META["Metadata DB (runs, params)"]
    ING --> OBJ["Artifact / blob store"]
    TS --> QRY["Query + aggregation service"]
    META --> QRY
    QRY --> UI["Dashboard / compare UI"]
    OBJ --> UI
      

Deep dive: metric ingestion at scale

The firehose is the hard part. A run logging every step can emit thousands of points per minute; multiply by thousands of runs and you are at ~105 points/s. Two rules drive every decision: never block the training loop, and never lose a finished run's record. The client absorbs the first; the ingestion log and TSDB absorb the second.

sequenceDiagram
    participant R as Training run SDK
    participant B as Local buffer
    participant API as Ingestion API
    participant Q as Write buffer
    participant TS as Time-series store
    R->>B: log step + metrics, non-blocking
    B->>B: Batch and coalesce points
    B->>API: Flush batch every ~5s
    API->>Q: Append and ack fast
    Q-->>API: Buffered durably
    API-->>B: 200 OK
    Q->>TS: Bulk write + compress
    Note over B,API: If network down, buffer locally and retry
      

Trade-off: async batching vs durability

Async batching trades a few seconds of crash-window metrics for zero training-loop impact — the right call, because the expensive resource (GPU time) must never wait on the tracker. The durability guarantee therefore lives on params and artifacts, written/flushed deliberately, while raw metric points are treated as best-effort telemetry. Choosing where the strong guarantee sits is the key decision.

Deep dive: hyperparameter search

A sweep turns the tracker from a passive recorder into an active optimizer. Given a search space and an objective (e.g. minimize val_loss), the controller decides which configs to try next and which running trials to kill — and the second is where the real GPU savings come from. The strategies trade sample efficiency against parallelism.

Strategy How it picks configs When to use Cost / trait
Grid search Every combination of a discrete grid. Few params (≤3); need exhaustive coverage. Explodes combinatorially; wastes budget on dead dimensions.
Random search Sample uniformly from the space. Strong default; beats grid when only a few params matter. Embarrassingly parallel; learns nothing from past trials.
Bayesian opt (TPE/GP) Fit a surrogate of objective vs params; sample where expected improvement is high. Expensive objectives; want the answer in few trials. Sample-efficient but more sequential; harder to parallelize.
Hyperband / ASHA Random configs + successive halving: run many cheaply, promote the top fraction to more budget. Large parallel fleets with a meaningful early signal. Best throughput; needs early metric to correlate with final.
flowchart TD
    SP["Search space + strategy"] --> CTRL["HPO controller"]
    CTRL --> SUG["Suggest next configs"]
    SUG --> L1["Trial 1"]
    SUG --> L2["Trial 2"]
    SUG --> L3["Trial N"]
    L1 --> EVAL["Report metric at rung"]
    L2 --> EVAL
    L3 --> EVAL
    EVAL --> ASHA{"Promote or stop?"}
    ASHA -->|top fraction| PROMO["More budget, next rung"]
    ASHA -->|rest| KILL["Early-stop, free GPUs"]
    PROMO --> CTRL
    KILL --> SCHED["GPU scheduler reclaims GPUs"]
    SCHED --> CTRL
      

Trade-off: parallelism vs sample efficiency

Bayesian optimization is sample-efficient but wants results before suggesting the next config (sequential); ASHA is parallel-friendly but assumes the early metric tracks final performance. BOHB combines Bayesian suggestion with Hyperband budgets to get both. The hidden risk: early stopping fails when early signal is misleading (e.g. LR warmup or cosine schedules) — guard it with a minimum budget per rung.

Reproducibility & governance

A metric is worthless if you can't reproduce the run that produced it. Reproducibility means capturing everything that influences the result at run start, automatically, so "rerun X" is one click and "why did this metric move?" is answerable. The tracker pins it all to the immutable Run record.

Dimension What's pinned How
Code Exact git commit + uncommitted diff. SDK reads the git SHA at start; warns on and stores any dirty patch.
Data Dataset version / snapshot id. Content hash or dataset-version pointer (cross-ref the ML data pipeline / model registry).
Environment Container image digest, lockfile, CUDA/driver versions. Pin the image by sha256 digest, never a mutable tag.
Config Full resolved hyperparameters (immutable Params). Logged at run start; never mutated thereafter.
Hardware GPU type/count and topology. From the scheduler; affects throughput and sometimes numerics.
Randomness All seeds (framework, dataloader, CUDA) + determinism flags. Logged; note exact bitwise determinism on GPU is best-effort.

The durability boundary

Params, artifacts, and lineage are the durable system of record — strongly consistent, immutable, audited. Raw metric points are best-effort telemetry — high-volume, compressed, downsampled, eventually tiered away. Different guarantees, different stores: that separation is what lets the same system be both trustworthy and cheap at 105 points/s.

Bottlenecks & scaling

A tracker rarely dies on raw QPS — it degrades through metric write volume, query fan-out over many runs, high-cardinality series, and artifact cost. The mitigations keep the firehose flowing while keeping dashboards interactive.

Bottleneck Why it hurts Mitigation
Metric write volume ~105+ points/s swamps a naive row-per-insert DB. Batch on the client; ingest via a partitioned log; columnar TSDB with delta + XOR compression; shard by series.
Query latency over many runs Comparing 100 runs × millions of points/series melts dashboards. Server-side downsampling/roll-ups; pre-aggregate; resolution-by-viewport; cache immutable windows; cap series per chart.
High-cardinality series Per-step, per-GPU, per-layer metrics explode series count → index blowup. Curate/cap keys; separate system-metric family with shorter retention; tag instead of minting new series; reject runaway cardinality.
Artifact storage Checkpoints (1-100 GB) × many runs → PB and real money. Content-addressed dedup; tier hot → cold; TTL on non-promoted runs; presigned URLs + CDN/peer mesh for downloads.
Sweep orchestration Thousands of trials thrash the controller + scheduler. Async ASHA (no global barrier); persist search state; rate-limit suggestions; delegate placement to the GPU scheduler; backfill freed GPUs.
Hot run / write skew A few giant runs (thousands of series) hotspot one partition. Hash series across partitions; per-series shard key; isolate noisy tenants from shared infra.
Metadata contention Frequent status/tag updates + heavy list/filter queries on one RDBMS. Read replicas; index by experiment/tag/metric; split hot mutable status from immutable params; cache alias/resolve reads.

Summary — what a staff answer nails

Frame it as two systems behind one SDK: a write-optimized telemetry firehose (async batched client → partitioned log → compressed columnar TSDB at ~105 points/s, downsampled and tiered) and a read-optimized analytics + control plane (metadata RDBMS + artifact store powering compare/leaderboards and a search controller). Make the client non-blocking so the tracker never steals GPU time, and put the durability guarantee on params/artifacts/lineage while treating raw metrics as best-effort. For HPO, lead with ASHA/Hyperband early-stopping to concentrate GPUs on promising trials, add Bayesian suggestion for sample efficiency, and delegate placement to the GPU scheduler. Close on reproducibility — pin code, data, container digest, and seeds — because an experiment you can't reproduce isn't an experiment.