AI / ML Infrastructure

Model Registry & Artifact Store

A model registry is the system of record for trained models: it answers “which exact bytes are running in prod, where did they come from, and how do I get them onto a thousand machines fast?” Two very different systems hide behind that one question — a small, strongly-consistent metadata store that versions models, tracks lineage, and moves promotion aliases; and a massive, immutable, content-addressed blob store whose hard problem is fanning a 200 GB artifact out to the entire serving fleet without melting the network. Get that split right and promotion, rollback, reproducibility, and deploy-time throughput all fall out of it.

Requirements

We are building the control plane for models — the catalog that sits between training and serving. Training pipelines write finished models into it; CI/CD and the serving fleet read from it to decide what to run and to pull the bytes. Training itself, experiment tracking, and the inference servers are adjacent systems, not in scope here.

Functional	Non-functional
Register model versions + metadata — create a new immutable version of a model with framework, metrics, hyperparameters, tags, and the artifact digest.	High read throughput at deploy time — thousands of serving nodes resolve and pull the same model in a tight rollout window; the read/resolve path must not buckle.
Lineage — every version links back to the training run, the dataset version, and the exact code commit / container image that produced it.	Immutability & integrity — a version’s bytes never change; what you pull is verified by hash and is tamper-evident. Reproducibility depends on it.
Stage promotion — move a version through `dev` → `staging` → `prod` via named aliases (champion/challenger), independent of immutable version numbers.	Low-latency lookups — resolve `alias → version → digest` in single-digit milliseconds; this is on the critical path of every deploy.
Download artifacts — the serving fleet pulls the exact bytes for a version onto thousands of nodes (often a presigned URL to a CDN/peer mesh).	Durability & availability — artifacts on 11-nines object storage; the resolve API is highly available (a stale-but-correct read beats an outage during a rollout).
Rollback — instantly re-point the `prod` alias to a previous known-good version; no rebuild, no re-upload, just a pointer flip.	Security & governance — authn/authz, full audit trail, approval gates on promotion, and signing/provenance so only vetted artifacts reach prod.
Search / list / compare — list versions of a model, filter by stage/metric, and diff metadata to pick a release candidate.	Cost efficiency — PB-scale artifacts demand dedup of shared weights and tiering of cold versions to cheap storage.

The defining split

The metadata is tiny, mutable-by-pointer, and needs strong consistency (you must never read a half-promoted alias). The artifacts are huge, write-once, and need raw bandwidth. Treating them as one system is the classic mistake; the whole design is two stores joined by a content hash.

Scale & back-of-envelope

The numbers below show why metadata is the easy half. The catalog is megabytes of rows; the pain is entirely in moving immutable gigabytes to many nodes at once.

Dimension	Estimate	Implication
Artifact size	tens of MB → 200+ GB	Small classifier ~10 MB; 7B fp16 ~14 GB; 70B ~140 GB; 405B ~810 GB. Forces chunking, ranged/resumable transfer, dedup.
Models × versions	~5,000 × 50 ≈ 250k versions	Each metadata row is a few KB ⇒ the entire catalog is tens of MB. A single RDBMS holds it comfortably.
Logical vs physical bytes	250k × 5 GB ≈ 1.2 PB logical	Most versions share base weights; content-addressed dedup cuts physical storage 2–10×.
Deploy fan-out	200 GB × 1,000 nodes = 200 TB / rollout	The headline problem. Naively served from one bucket, this takes hours and saturates the origin.
Per-node link	25 Gbps ≈ 3.1 GB/s	One node pulls 200 GB in ~65 s. But 1,000 nodes wanting 3.1 GB/s each = ~3.1 TB/s aggregate — no single origin serves that.
Metadata QPS	writes: tens–hundreds/day; reads: bursty at rollout	Registrations/promotions are rare. Alias resolves spike during deploys but are ~100% cacheable (immutable versions).

Takeaway: a single origin pull is fine; the simultaneous pull is what hurts. If every node fetches from the object store at once, demand (~3.1 TB/s) dwarfs any bucket’s egress, so the rollout serializes behind the origin and drags on for hours. The fix is to serve the bytes once and let them replicate — CDN edge caching plus peer-to-peer distribution — bringing fleet-wide completion back down to a small multiple of a single-node download (minutes, not hours).

Core entities

Five entities carry the model. The crucial design choice: a ModelVersion is immutable and points at an Artifact by content hash, while a Stage/Alias is a movable pointer. Promotion and rollback are therefore just alias updates — cheap, atomic, and auditable — never re-uploads.

Entity	Key fields	Notes
Model	`model_id`, `name`, `owner`, `task_type`, `description`, `created_at`	Registry-level container. Mutable metadata (owner, docs). Names a family of versions.
ModelVersion	`version_id`, `model_id`, `version_no`, `artifact_digest`, `run_id`, `framework`, `metrics`, `params`, `created_by`	Immutable once registered. Holds the hash of its bytes and a link to its lineage. The unit you promote.
Artifact (blob)	`digest` (sha256, PK), `size_bytes`, `media_type`, `layers[]`, `storage_url`	Content-addressed, write-once. A manifest of layer hashes so shared weights dedup across versions.
Run / Lineage	`run_id`, `dataset_version`, `code_commit`, `container_image`, `hyperparams`, `started_at`, `finished_at`	The provenance of a version: what produced these bytes. Enables reproducibility and audit.
Stage / Alias	`alias` (e.g. `prod`), `model_id`, `current_version_id`, `updated_by`, `updated_at`	A movable pointer. Promotion = set `current_version_id`; rollback = set it back. Atomic + audited.

A compact relational sketch — small, strongly-consistent, RDBMS-shaped:

Model
  model_id PK, name UNIQUE, owner, task_type, description, created_at

ModelVersion                       -- immutable; one row per registered version
  version_id PK, model_id FK -> Model,
  version_no,                       -- monotonic per model (1,2,3, ...)
  artifact_digest -> Artifact.digest,
  run_id FK -> Run,
  framework, metrics JSON, params JSON,
  status, created_by, created_at
  UNIQUE (model_id, version_no)

Artifact                           -- content-addressed blob / manifest
  digest PK,                        -- sha256 of the manifest (the model's identity)
  size_bytes, media_type,
  layers JSON,                      -- [ sha256 of each shared chunk/file ]
  storage_url, created_at

Run                                -- training provenance
  run_id PK, dataset_version, code_commit,   -- git sha
  container_image,                           -- image digest
  hyperparams JSON, started_at, finished_at

Alias                              -- movable promotion pointer
  alias, model_id FK -> Model,     -- PK (model_id, alias)
  current_version_id FK -> ModelVersion,
  updated_by, updated_at

API design

Writes (register, promote) are rare and gated; reads (resolve, get, download URL) are hot and cacheable. The download path hands back a presigned/expiring URL so bytes never flow through the registry service itself.

Register an immutable version

POST /v1/models/{name}/versions
{
  "run_id":          "run_8f1c...",
  "artifact_digest": "sha256:9f3a...",      // bytes already uploaded to blob store
  "framework":       "pytorch",
  "metrics":         { "eval_acc": 0.921, "f1": 0.88 },
  "params":          { "lr": 3e-4, "epochs": 3 }
}

201 Created  → { "version": 42, "version_id": "mv_...", "status": "registered" }
409 Conflict → this digest is already registered (idempotent re-register)

Get a specific version (immutable ⇒ cache forever)

GET /v1/models/{name}/versions/42
200 OK → { version, artifact_digest, framework, metrics, run_id, created_at, ... }
Cache-Control: public, max-age=31536000, immutable

Resolve an alias / stage (hot read path at deploy)

GET /v1/models/{name}/aliases/prod
200 OK → { "alias": "prod", "version": 42, "artifact_digest": "sha256:9f3a..." }
Cache-Control: public, max-age=10        // short TTL; invalidated on promote

Promote / set stage or alias (gated + audited)

PUT /v1/models/{name}/aliases/prod
{ "version": 42, "reason": "passed canary; +1.2% f1" }

200 OK        → alias now points at 42  (atomic; previous target retained for rollback)
403 Forbidden → caller lacks promote permission, or approval gate not satisfied

List / compare versions

GET /v1/models/{name}/versions?stage=staging&sort=-metrics.f1&limit=50
200 OK → { "versions": [ ... ], "next_cursor": "..." }

Get a download URL for the artifact (presigned, edge-served)

GET /v1/artifacts/{digest}/download?ttl=600
200 OK → {
  "url":        "https://cdn.example.net/blobs/sha256/9f/3a/9f3a...",
  "scheme":     "p2p+https",            // fall back to https if no peer mesh
  "expires_at": "2026-06-16T00:10:00Z",
  "size_bytes": 214748364800
}
# Client verifies sha256 of the downloaded bytes == digest before loading.

Rollback is not a special endpoint — it is just PUT .../aliases/prod { "version": 41 }. Because every prior version’s bytes still exist immutably, recovery is a pointer flip measured in milliseconds.

High-level design

Producers (training jobs, CI) write metadata + bytes; the registry service owns the strongly-consistent metadata DB and issues presigned URLs into the blob store. The serving fleet resolves an alias, then pulls bytes not from the registry but from a CDN + peer-distribution layer fronting the blob store — that is what survives the deploy fan-out.

flowchart LR
    TR["Training Jobs (register)"] --> REG["Registry Service (API)"]
    CI["CI / Promotion Pipeline"] --> REG
    REG --> MDB[("Metadata DB (versions, lineage, aliases)")]
    REG --> CACHE[("Resolve Cache")]
    REG --> BLOB["Blob / Object Store (write-once)"]
    BLOB --> CDN["CDN / Edge Cache"]
    BLOB --> P2P["Peer Distribution (Dragonfly / Kraken)"]
    CDN --> FLEET["Serving Fleet (1000s of nodes)"]
    P2P --> FLEET
    FLEET -->|resolve alias| REG

Two stores, one hash. The metadata DB is small and consistent; the blob store is huge and immutable. A ModelVersion row glues them by artifact_digest, so the catalog can be replicated and cached freely while the bytes live wherever is cheapest.
The registry is off the data path. Bytes flow client → CDN/peer mesh → client. The service only resolves metadata and mints presigned URLs, so a 200 GB pull never touches application servers.
Resolve is cacheable. Version metadata is immutable (cache forever); alias resolution uses a short TTL plus invalidate-on-promote, so a rollout’s read storm is absorbed at the edge.
Upload before register. Producers push bytes to the blob store first (content-addressed), then register the version referencing the digest — registration is metadata-only and fast, and re-registering the same digest is idempotent.

Deep dive: artifact storage, dedup & the fan-out problem

Artifacts are stored content-addressed: the key is the sha256 of the bytes. This gives three properties for free — immutability (the name is the content), integrity (the client re-hashes and verifies on pull), and dedup (identical bytes are stored once, no matter how many versions reference them).

Content-addressed dedup via layered manifests

A model is rarely one monolithic blob; it is a manifest of layers (base weights, adapters, tokenizer, config), each addressed by its own hash. Fine-tunes and re-exports share the heavy base layer, so a new version often costs only the bytes that actually changed:

# Manifest for ModelVersion 42  (sha256:9f3a...)
sha256:9f3a...  model-v42.manifest
  ├─ sha256:aaaa...  base_weights.safetensors   14.0 GB   ← shared by v40–v44
  ├─ sha256:bbbb...  lora_adapter.safetensors    120 MB   ← unique to v42
  ├─ sha256:cccc...  tokenizer.json              2.1 MB   ← shared
  └─ sha256:dddd...  config.json                   4 KB

# Registering v43 that only swaps the adapter stores ~120 MB, not 14 GB.

Because layers are immutable, a garbage collector can safely reclaim any blob no longer referenced by a manifest (mark-and-sweep over Artifact.layers), and cold versions can be tiered to cheaper/colder storage classes.

The deploy fan-out problem

Promoting a 200 GB model and rolling it to 1,000 nodes is 200 TB of egress in one window. Served from a single origin it is hopeless; the answer is to serve one copy and let it replicate. Edge seeders/super-peers warm from the origin, then nodes share chunks with each other so origin egress stays near one copy while fleet completion approaches a small multiple of a single-node download:

flowchart LR
    ORIG["Origin Object Store (serves ~1 copy)"] --> SEED["Edge Seeders / Super-peers"]
    SEED --> N1["Node 1"]
    SEED --> N2["Node 2"]
    N1 --> N3["Node 3"]
    N1 --> N4["Node 4"]
    N2 --> N5["Node 5"]
    N2 --> N6["Node 6"]
    N3 --> N7["Node 7"]
    N4 --> N8["Node 8"]

Strategy	How it works	Trade-off
Direct from object store	Each node fetches the blob over HTTPS.	Simple, but origin egress is the wall; 1,000 concurrent pulls serialize behind it ⇒ hours.
CDN / edge cache	First pull warms an edge POP; the rest hit the cache.	Great for many small/medium artifacts; still O(N) edge egress for one giant blob within a rack/zone.
P2P (Dragonfly / Kraken)	BitTorrent-style chunk swarm; nodes seed to each other; origin serves ~1 copy.	Best fleet-wide throughput; adds a peer agent + tracker and intra-cluster traffic to manage.
Lazy / streaming load	Pull only the chunks needed to start (lazy-pull images, mmap, demand paging of weights).	Slashes time-to-ready; risk of stalls on cold reads if the network is slow mid-inference.

Why immutability makes fan-out tractable

Every distribution trick above — CDN caching, P2P swarming, lazy paging — relies on bytes never changing under a hash. A chunk fetched from any peer can be trusted after a local sha256 check, so caches can be shared globally and a corrupt/poisoned chunk is detected and refetched rather than served.

Deep dive: lineage, reproducibility & governance

A registry that only stores bytes is a liability; auditors and on-callers need to answer “exactly what is in prod, and can we rebuild it?” Lineage threads each version back to the run that made it, and that run back to its dataset version, code commit, and container image — a fully pinned recipe. Governance then guards the path from a registered version to prod.

flowchart LR
    DS["Dataset vN"] --> RUN["Training Run"]
    CODE["Code commit (git sha)"] --> RUN
    IMG["Container image digest"] --> RUN
    HP["Hyperparams"] --> RUN
    RUN --> MV["ModelVersion 42"]
    MV --> SIGN["Signed manifest (provenance)"]
    SIGN --> GATE{"Approval gate"}
    GATE -->|approved| PROD["alias: prod"]
    GATE -->|rejected| STG["stays in staging"]

Reproducibility = pinned inputs. Storing the dataset version, the git SHA, the image digest, and the hyperparameters means a run can be re-executed to (near) bit-identical weights. Floating tags (:latest, mutable branches) break this — everything is referenced by immutable digest.
Promotion gates. Moving to staging or prod can require passing evals, a canary, and a human/automated approval. The alias write is rejected unless the gate is satisfied, so policy is enforced at the system-of-record, not in a wiki.
Audit trail. Every register and promote is an append-only event (who, when, from-version → to-version, reason). During an incident you can answer “what changed and who changed it” in seconds, and reconstruct the exact state at any past time.
Signing & provenance (supply-chain security). Artifacts are signed (e.g. Sigstore-style) and carry attestations binding bytes to the build that produced them. Serving nodes verify the signature and digest before loading, so a tampered or unknown artifact is refused — defending against model poisoning and registry compromise.
Access control. Read of a resolve/download URL, register, and promote are separately authorized; promotion to prod is the most privileged action and is always recorded.

Bottlenecks & scaling

A model registry almost never falls over on metadata QPS — the catalog is tiny. It degrades on bytes in motion and on a few hot keys during rollouts. The mitigations keep the consistent metadata core while pushing the heavy, immutable data out to caches and peers.

Bottleneck	Why it hurts	Mitigation
Deploy fan-out bandwidth	200 GB × 1,000 nodes = 200 TB from one origin ⇒ the rollout serializes for hours.	CDN edge caching + P2P swarm (Dragonfly/Kraken) so origin serves ~1 copy; dedup so only changed layers move; lazy/streaming load to start before full pull.
Metadata hot key	Every node resolves the same `prod` alias in a tight window ⇒ read hotspot.	Short-TTL edge/cache on resolve + invalidate-on-promote; immutable version reads cache forever; read replicas for the metadata DB.
Large blobs	Single 200 GB transfers stall on any blip; one slow byte kills the whole pull.	Chunk + content-address each layer; multipart, ranged, resumable transfer; parallel chunk fetch; verify per-chunk hashes.
Integrity / tampering	A corrupt or poisoned artifact reaching prod is catastrophic and silent.	Immutable write-once store; `sha256` verify on pull; signing + provenance; fail-closed if signature/digest mismatches.
Thundering herd on promote	Flipping the alias makes the whole fleet stampede for the new (cold) bytes at once.	Pre-warm CDN/peer mesh and seeders before the flip; staged/canary rollout; request coalescing at the edge.
Storage growth & cost	250k versions × multi-GB ⇒ petabytes, much of it cold and redundant.	Layer dedup; mark-and-sweep GC of unreferenced blobs; tier cold versions to cheaper classes; retention policies.

Summary — what a staff answer nails

Frame it as two systems joined by a content hash: a small, strongly-consistent metadata store that versions models, records lineage (run → dataset → code → image), and moves promotion via movable aliases so rollback is a pointer flip; and a huge, immutable, content-addressed blob store that gets integrity, immutability, and dedup for free. Keep the registry off the data path (presigned URLs) and solve the real problem — the deploy fan-out — with CDN + peer-to-peer distribution, lazy loading, and pre-warming, so a 200 GB model reaches a thousand nodes in minutes instead of hours. Guard prod with approval gates, audit, and signed provenance.