System Design Notes All designs

AI / ML Infrastructure

Model Registry & Artifact Store

A model registry is the system of record for trained models: it answers “which exact bytes are running in prod, where did they come from, and how do I get them onto a thousand machines fast?” Two very different systems hide behind that one question — a small, strongly-consistent metadata store that versions models, tracks lineage, and moves promotion aliases; and a massive, immutable, content-addressed blob store whose hard problem is fanning a 200 GB artifact out to the entire serving fleet without melting the network. Get that split right and promotion, rollback, reproducibility, and deploy-time throughput all fall out of it.

Requirements

We are building the control plane for models — the catalog that sits between training and serving. Training pipelines write finished models into it; CI/CD and the serving fleet read from it to decide what to run and to pull the bytes. Training itself, experiment tracking, and the inference servers are adjacent systems, not in scope here.

Functional Non-functional
Register model versions + metadata — create a new immutable version of a model with framework, metrics, hyperparameters, tags, and the artifact digest. High read throughput at deploy time — thousands of serving nodes resolve and pull the same model in a tight rollout window; the read/resolve path must not buckle.
Lineage — every version links back to the training run, the dataset version, and the exact code commit / container image that produced it. Immutability & integrity — a version’s bytes never change; what you pull is verified by hash and is tamper-evident. Reproducibility depends on it.
Stage promotion — move a version through devstagingprod via named aliases (champion/challenger), independent of immutable version numbers. Low-latency lookups — resolve alias → version → digest in single-digit milliseconds; this is on the critical path of every deploy.
Download artifacts — the serving fleet pulls the exact bytes for a version onto thousands of nodes (often a presigned URL to a CDN/peer mesh). Durability & availability — artifacts on 11-nines object storage; the resolve API is highly available (a stale-but-correct read beats an outage during a rollout).
Rollback — instantly re-point the prod alias to a previous known-good version; no rebuild, no re-upload, just a pointer flip. Security & governance — authn/authz, full audit trail, approval gates on promotion, and signing/provenance so only vetted artifacts reach prod.
Search / list / compare — list versions of a model, filter by stage/metric, and diff metadata to pick a release candidate. Cost efficiency — PB-scale artifacts demand dedup of shared weights and tiering of cold versions to cheap storage.

The defining split

The metadata is tiny, mutable-by-pointer, and needs strong consistency (you must never read a half-promoted alias). The artifacts are huge, write-once, and need raw bandwidth. Treating them as one system is the classic mistake; the whole design is two stores joined by a content hash.

Scale & back-of-envelope

The numbers below show why metadata is the easy half. The catalog is megabytes of rows; the pain is entirely in moving immutable gigabytes to many nodes at once.

Dimension Estimate Implication
Artifact size tens of MB → 200+ GB Small classifier ~10 MB; 7B fp16 ~14 GB; 70B ~140 GB; 405B ~810 GB. Forces chunking, ranged/resumable transfer, dedup.
Models × versions ~5,000 × 50 ≈ 250k versions Each metadata row is a few KB ⇒ the entire catalog is tens of MB. A single RDBMS holds it comfortably.
Logical vs physical bytes 250k × 5 GB ≈ 1.2 PB logical Most versions share base weights; content-addressed dedup cuts physical storage 2–10×.
Deploy fan-out 200 GB × 1,000 nodes = 200 TB / rollout The headline problem. Naively served from one bucket, this takes hours and saturates the origin.
Per-node link 25 Gbps ≈ 3.1 GB/s One node pulls 200 GB in ~65 s. But 1,000 nodes wanting 3.1 GB/s each = ~3.1 TB/s aggregate — no single origin serves that.
Metadata QPS writes: tens–hundreds/day; reads: bursty at rollout Registrations/promotions are rare. Alias resolves spike during deploys but are ~100% cacheable (immutable versions).

Takeaway: a single origin pull is fine; the simultaneous pull is what hurts. If every node fetches from the object store at once, demand (~3.1 TB/s) dwarfs any bucket’s egress, so the rollout serializes behind the origin and drags on for hours. The fix is to serve the bytes once and let them replicate — CDN edge caching plus peer-to-peer distribution — bringing fleet-wide completion back down to a small multiple of a single-node download (minutes, not hours).

Core entities

Five entities carry the model. The crucial design choice: a ModelVersion is immutable and points at an Artifact by content hash, while a Stage/Alias is a movable pointer. Promotion and rollback are therefore just alias updates — cheap, atomic, and auditable — never re-uploads.

Entity Key fields Notes
Model model_id, name, owner, task_type, description, created_at Registry-level container. Mutable metadata (owner, docs). Names a family of versions.
ModelVersion version_id, model_id, version_no, artifact_digest, run_id, framework, metrics, params, created_by Immutable once registered. Holds the hash of its bytes and a link to its lineage. The unit you promote.
Artifact (blob) digest (sha256, PK), size_bytes, media_type, layers[], storage_url Content-addressed, write-once. A manifest of layer hashes so shared weights dedup across versions.
Run / Lineage run_id, dataset_version, code_commit, container_image, hyperparams, started_at, finished_at The provenance of a version: what produced these bytes. Enables reproducibility and audit.
Stage / Alias alias (e.g. prod), model_id, current_version_id, updated_by, updated_at A movable pointer. Promotion = set current_version_id; rollback = set it back. Atomic + audited.

A compact relational sketch — small, strongly-consistent, RDBMS-shaped:

Model
  model_id PK, name UNIQUE, owner, task_type, description, created_at

ModelVersion                       -- immutable; one row per registered version
  version_id PK, model_id FK -> Model,
  version_no,                       -- monotonic per model (1,2,3, ...)
  artifact_digest -> Artifact.digest,
  run_id FK -> Run,
  framework, metrics JSON, params JSON,
  status, created_by, created_at
  UNIQUE (model_id, version_no)

Artifact                           -- content-addressed blob / manifest
  digest PK,                        -- sha256 of the manifest (the model's identity)
  size_bytes, media_type,
  layers JSON,                      -- [ sha256 of each shared chunk/file ]
  storage_url, created_at

Run                                -- training provenance
  run_id PK, dataset_version, code_commit,   -- git sha
  container_image,                           -- image digest
  hyperparams JSON, started_at, finished_at

Alias                              -- movable promotion pointer
  alias, model_id FK -> Model,     -- PK (model_id, alias)
  current_version_id FK -> ModelVersion,
  updated_by, updated_at

API design

Writes (register, promote) are rare and gated; reads (resolve, get, download URL) are hot and cacheable. The download path hands back a presigned/expiring URL so bytes never flow through the registry service itself.

Register an immutable version

POST /v1/models/{name}/versions
{
  "run_id":          "run_8f1c...",
  "artifact_digest": "sha256:9f3a...",      // bytes already uploaded to blob store
  "framework":       "pytorch",
  "metrics":         { "eval_acc": 0.921, "f1": 0.88 },
  "params":          { "lr": 3e-4, "epochs": 3 }
}

201 Created  → { "version": 42, "version_id": "mv_...", "status": "registered" }
409 Conflict → this digest is already registered (idempotent re-register)

Get a specific version (immutable ⇒ cache forever)

GET /v1/models/{name}/versions/42
200 OK → { version, artifact_digest, framework, metrics, run_id, created_at, ... }
Cache-Control: public, max-age=31536000, immutable

Resolve an alias / stage (hot read path at deploy)

GET /v1/models/{name}/aliases/prod
200 OK → { "alias": "prod", "version": 42, "artifact_digest": "sha256:9f3a..." }
Cache-Control: public, max-age=10        // short TTL; invalidated on promote

Promote / set stage or alias (gated + audited)

PUT /v1/models/{name}/aliases/prod
{ "version": 42, "reason": "passed canary; +1.2% f1" }

200 OK        → alias now points at 42  (atomic; previous target retained for rollback)
403 Forbidden → caller lacks promote permission, or approval gate not satisfied

List / compare versions

GET /v1/models/{name}/versions?stage=staging&sort=-metrics.f1&limit=50
200 OK → { "versions": [ ... ], "next_cursor": "..." }

Get a download URL for the artifact (presigned, edge-served)

GET /v1/artifacts/{digest}/download?ttl=600
200 OK → {
  "url":        "https://cdn.example.net/blobs/sha256/9f/3a/9f3a...",
  "scheme":     "p2p+https",            // fall back to https if no peer mesh
  "expires_at": "2026-06-16T00:10:00Z",
  "size_bytes": 214748364800
}
# Client verifies sha256 of the downloaded bytes == digest before loading.

Rollback is not a special endpoint — it is just PUT .../aliases/prod { "version": 41 }. Because every prior version’s bytes still exist immutably, recovery is a pointer flip measured in milliseconds.

High-level design

Producers (training jobs, CI) write metadata + bytes; the registry service owns the strongly-consistent metadata DB and issues presigned URLs into the blob store. The serving fleet resolves an alias, then pulls bytes not from the registry but from a CDN + peer-distribution layer fronting the blob store — that is what survives the deploy fan-out.

flowchart LR
    TR["Training Jobs (register)"] --> REG["Registry Service (API)"]
    CI["CI / Promotion Pipeline"] --> REG
    REG --> MDB[("Metadata DB (versions, lineage, aliases)")]
    REG --> CACHE[("Resolve Cache")]
    REG --> BLOB["Blob / Object Store (write-once)"]
    BLOB --> CDN["CDN / Edge Cache"]
    BLOB --> P2P["Peer Distribution (Dragonfly / Kraken)"]
    CDN --> FLEET["Serving Fleet (1000s of nodes)"]
    P2P --> FLEET
    FLEET -->|resolve alias| REG
      

Deep dive: artifact storage, dedup & the fan-out problem

Artifacts are stored content-addressed: the key is the sha256 of the bytes. This gives three properties for free — immutability (the name is the content), integrity (the client re-hashes and verifies on pull), and dedup (identical bytes are stored once, no matter how many versions reference them).

Content-addressed dedup via layered manifests

A model is rarely one monolithic blob; it is a manifest of layers (base weights, adapters, tokenizer, config), each addressed by its own hash. Fine-tunes and re-exports share the heavy base layer, so a new version often costs only the bytes that actually changed:

# Manifest for ModelVersion 42  (sha256:9f3a...)
sha256:9f3a...  model-v42.manifest
  ├─ sha256:aaaa...  base_weights.safetensors   14.0 GB   ← shared by v40–v44
  ├─ sha256:bbbb...  lora_adapter.safetensors    120 MB   ← unique to v42
  ├─ sha256:cccc...  tokenizer.json              2.1 MB   ← shared
  └─ sha256:dddd...  config.json                   4 KB

# Registering v43 that only swaps the adapter stores ~120 MB, not 14 GB.

Because layers are immutable, a garbage collector can safely reclaim any blob no longer referenced by a manifest (mark-and-sweep over Artifact.layers), and cold versions can be tiered to cheaper/colder storage classes.

The deploy fan-out problem

Promoting a 200 GB model and rolling it to 1,000 nodes is 200 TB of egress in one window. Served from a single origin it is hopeless; the answer is to serve one copy and let it replicate. Edge seeders/super-peers warm from the origin, then nodes share chunks with each other so origin egress stays near one copy while fleet completion approaches a small multiple of a single-node download:

flowchart LR
    ORIG["Origin Object Store (serves ~1 copy)"] --> SEED["Edge Seeders / Super-peers"]
    SEED --> N1["Node 1"]
    SEED --> N2["Node 2"]
    N1 --> N3["Node 3"]
    N1 --> N4["Node 4"]
    N2 --> N5["Node 5"]
    N2 --> N6["Node 6"]
    N3 --> N7["Node 7"]
    N4 --> N8["Node 8"]
      
Strategy How it works Trade-off
Direct from object store Each node fetches the blob over HTTPS. Simple, but origin egress is the wall; 1,000 concurrent pulls serialize behind it ⇒ hours.
CDN / edge cache First pull warms an edge POP; the rest hit the cache. Great for many small/medium artifacts; still O(N) edge egress for one giant blob within a rack/zone.
P2P (Dragonfly / Kraken) BitTorrent-style chunk swarm; nodes seed to each other; origin serves ~1 copy. Best fleet-wide throughput; adds a peer agent + tracker and intra-cluster traffic to manage.
Lazy / streaming load Pull only the chunks needed to start (lazy-pull images, mmap, demand paging of weights). Slashes time-to-ready; risk of stalls on cold reads if the network is slow mid-inference.

Why immutability makes fan-out tractable

Every distribution trick above — CDN caching, P2P swarming, lazy paging — relies on bytes never changing under a hash. A chunk fetched from any peer can be trusted after a local sha256 check, so caches can be shared globally and a corrupt/poisoned chunk is detected and refetched rather than served.

Deep dive: lineage, reproducibility & governance

A registry that only stores bytes is a liability; auditors and on-callers need to answer “exactly what is in prod, and can we rebuild it?” Lineage threads each version back to the run that made it, and that run back to its dataset version, code commit, and container image — a fully pinned recipe. Governance then guards the path from a registered version to prod.

flowchart LR
    DS["Dataset vN"] --> RUN["Training Run"]
    CODE["Code commit (git sha)"] --> RUN
    IMG["Container image digest"] --> RUN
    HP["Hyperparams"] --> RUN
    RUN --> MV["ModelVersion 42"]
    MV --> SIGN["Signed manifest (provenance)"]
    SIGN --> GATE{"Approval gate"}
    GATE -->|approved| PROD["alias: prod"]
    GATE -->|rejected| STG["stays in staging"]
      

Bottlenecks & scaling

A model registry almost never falls over on metadata QPS — the catalog is tiny. It degrades on bytes in motion and on a few hot keys during rollouts. The mitigations keep the consistent metadata core while pushing the heavy, immutable data out to caches and peers.

Bottleneck Why it hurts Mitigation
Deploy fan-out bandwidth 200 GB × 1,000 nodes = 200 TB from one origin ⇒ the rollout serializes for hours. CDN edge caching + P2P swarm (Dragonfly/Kraken) so origin serves ~1 copy; dedup so only changed layers move; lazy/streaming load to start before full pull.
Metadata hot key Every node resolves the same prod alias in a tight window ⇒ read hotspot. Short-TTL edge/cache on resolve + invalidate-on-promote; immutable version reads cache forever; read replicas for the metadata DB.
Large blobs Single 200 GB transfers stall on any blip; one slow byte kills the whole pull. Chunk + content-address each layer; multipart, ranged, resumable transfer; parallel chunk fetch; verify per-chunk hashes.
Integrity / tampering A corrupt or poisoned artifact reaching prod is catastrophic and silent. Immutable write-once store; sha256 verify on pull; signing + provenance; fail-closed if signature/digest mismatches.
Thundering herd on promote Flipping the alias makes the whole fleet stampede for the new (cold) bytes at once. Pre-warm CDN/peer mesh and seeders before the flip; staged/canary rollout; request coalescing at the edge.
Storage growth & cost 250k versions × multi-GB ⇒ petabytes, much of it cold and redundant. Layer dedup; mark-and-sweep GC of unreferenced blobs; tier cold versions to cheaper classes; retention policies.

Summary — what a staff answer nails

Frame it as two systems joined by a content hash: a small, strongly-consistent metadata store that versions models, records lineage (run → dataset → code → image), and moves promotion via movable aliases so rollback is a pointer flip; and a huge, immutable, content-addressed blob store that gets integrity, immutability, and dedup for free. Keep the registry off the data path (presigned URLs) and solve the real problem — the deploy fan-out — with CDN + peer-to-peer distribution, lazy loading, and pre-warming, so a 200 GB model reaches a thousand nodes in minutes instead of hours. Guard prod with approval gates, audit, and signed provenance.