AI / ML Infrastructure
Model Registry & Artifact Store
A model registry is the system of record for trained models: it answers “which exact bytes are running in prod, where did they come from, and how do I get them onto a thousand machines fast?” Two very different systems hide behind that one question — a small, strongly-consistent metadata store that versions models, tracks lineage, and moves promotion aliases; and a massive, immutable, content-addressed blob store whose hard problem is fanning a 200 GB artifact out to the entire serving fleet without melting the network. Get that split right and promotion, rollback, reproducibility, and deploy-time throughput all fall out of it.
Requirements
We are building the control plane for models — the catalog that sits between training and serving. Training pipelines write finished models into it; CI/CD and the serving fleet read from it to decide what to run and to pull the bytes. Training itself, experiment tracking, and the inference servers are adjacent systems, not in scope here.
| Functional | Non-functional |
|---|---|
| Register model versions + metadata — create a new immutable version of a model with framework, metrics, hyperparameters, tags, and the artifact digest. | High read throughput at deploy time — thousands of serving nodes resolve and pull the same model in a tight rollout window; the read/resolve path must not buckle. |
| Lineage — every version links back to the training run, the dataset version, and the exact code commit / container image that produced it. | Immutability & integrity — a version’s bytes never change; what you pull is verified by hash and is tamper-evident. Reproducibility depends on it. |
Stage promotion — move a version
through dev → staging →
prod via named aliases
(champion/challenger), independent of immutable version
numbers.
|
Low-latency lookups — resolve
alias → version → digest in
single-digit milliseconds; this is on the critical path of
every deploy.
|
| Download artifacts — the serving fleet pulls the exact bytes for a version onto thousands of nodes (often a presigned URL to a CDN/peer mesh). | Durability & availability — artifacts on 11-nines object storage; the resolve API is highly available (a stale-but-correct read beats an outage during a rollout). |
Rollback — instantly re-point the
prod alias to a previous known-good version; no
rebuild, no re-upload, just a pointer flip.
|
Security & governance — authn/authz, full audit trail, approval gates on promotion, and signing/provenance so only vetted artifacts reach prod. |
| Search / list / compare — list versions of a model, filter by stage/metric, and diff metadata to pick a release candidate. | Cost efficiency — PB-scale artifacts demand dedup of shared weights and tiering of cold versions to cheap storage. |
The defining split
The metadata is tiny, mutable-by-pointer, and needs strong consistency (you must never read a half-promoted alias). The artifacts are huge, write-once, and need raw bandwidth. Treating them as one system is the classic mistake; the whole design is two stores joined by a content hash.
Scale & back-of-envelope
The numbers below show why metadata is the easy half. The catalog is megabytes of rows; the pain is entirely in moving immutable gigabytes to many nodes at once.
| Dimension | Estimate | Implication |
|---|---|---|
| Artifact size | tens of MB → 200+ GB | Small classifier ~10 MB; 7B fp16 ~14 GB; 70B ~140 GB; 405B ~810 GB. Forces chunking, ranged/resumable transfer, dedup. |
| Models × versions | ~5,000 × 50 ≈ 250k versions | Each metadata row is a few KB ⇒ the entire catalog is tens of MB. A single RDBMS holds it comfortably. |
| Logical vs physical bytes | 250k × 5 GB ≈ 1.2 PB logical | Most versions share base weights; content-addressed dedup cuts physical storage 2–10×. |
| Deploy fan-out | 200 GB × 1,000 nodes = 200 TB / rollout | The headline problem. Naively served from one bucket, this takes hours and saturates the origin. |
| Per-node link | 25 Gbps ≈ 3.1 GB/s | One node pulls 200 GB in ~65 s. But 1,000 nodes wanting 3.1 GB/s each = ~3.1 TB/s aggregate — no single origin serves that. |
| Metadata QPS | writes: tens–hundreds/day; reads: bursty at rollout | Registrations/promotions are rare. Alias resolves spike during deploys but are ~100% cacheable (immutable versions). |
Takeaway: a single origin pull is fine; the simultaneous pull is what hurts. If every node fetches from the object store at once, demand (~3.1 TB/s) dwarfs any bucket’s egress, so the rollout serializes behind the origin and drags on for hours. The fix is to serve the bytes once and let them replicate — CDN edge caching plus peer-to-peer distribution — bringing fleet-wide completion back down to a small multiple of a single-node download (minutes, not hours).
Core entities
Five entities carry the model. The crucial design choice: a ModelVersion is immutable and points at an Artifact by content hash, while a Stage/Alias is a movable pointer. Promotion and rollback are therefore just alias updates — cheap, atomic, and auditable — never re-uploads.
| Entity | Key fields | Notes |
|---|---|---|
| Model |
model_id, name, owner,
task_type, description,
created_at
|
Registry-level container. Mutable metadata (owner, docs). Names a family of versions. |
| ModelVersion |
version_id, model_id,
version_no, artifact_digest,
run_id, framework,
metrics, params,
created_by
|
Immutable once registered. Holds the hash of its bytes and a link to its lineage. The unit you promote. |
| Artifact (blob) |
digest (sha256, PK), size_bytes,
media_type, layers[],
storage_url
|
Content-addressed, write-once. A manifest of layer hashes so shared weights dedup across versions. |
| Run / Lineage |
run_id, dataset_version,
code_commit, container_image,
hyperparams, started_at,
finished_at
|
The provenance of a version: what produced these bytes. Enables reproducibility and audit. |
| Stage / Alias |
alias (e.g. prod),
model_id, current_version_id,
updated_by, updated_at
|
A movable pointer. Promotion = set
current_version_id; rollback = set it back.
Atomic + audited.
|
A compact relational sketch — small, strongly-consistent, RDBMS-shaped:
Model
model_id PK, name UNIQUE, owner, task_type, description, created_at
ModelVersion -- immutable; one row per registered version
version_id PK, model_id FK -> Model,
version_no, -- monotonic per model (1,2,3, ...)
artifact_digest -> Artifact.digest,
run_id FK -> Run,
framework, metrics JSON, params JSON,
status, created_by, created_at
UNIQUE (model_id, version_no)
Artifact -- content-addressed blob / manifest
digest PK, -- sha256 of the manifest (the model's identity)
size_bytes, media_type,
layers JSON, -- [ sha256 of each shared chunk/file ]
storage_url, created_at
Run -- training provenance
run_id PK, dataset_version, code_commit, -- git sha
container_image, -- image digest
hyperparams JSON, started_at, finished_at
Alias -- movable promotion pointer
alias, model_id FK -> Model, -- PK (model_id, alias)
current_version_id FK -> ModelVersion,
updated_by, updated_at
API design
Writes (register, promote) are rare and gated; reads (resolve, get, download URL) are hot and cacheable. The download path hands back a presigned/expiring URL so bytes never flow through the registry service itself.
Register an immutable version
POST /v1/models/{name}/versions
{
"run_id": "run_8f1c...",
"artifact_digest": "sha256:9f3a...", // bytes already uploaded to blob store
"framework": "pytorch",
"metrics": { "eval_acc": 0.921, "f1": 0.88 },
"params": { "lr": 3e-4, "epochs": 3 }
}
201 Created → { "version": 42, "version_id": "mv_...", "status": "registered" }
409 Conflict → this digest is already registered (idempotent re-register)
Get a specific version (immutable ⇒ cache forever)
GET /v1/models/{name}/versions/42
200 OK → { version, artifact_digest, framework, metrics, run_id, created_at, ... }
Cache-Control: public, max-age=31536000, immutable
Resolve an alias / stage (hot read path at deploy)
GET /v1/models/{name}/aliases/prod
200 OK → { "alias": "prod", "version": 42, "artifact_digest": "sha256:9f3a..." }
Cache-Control: public, max-age=10 // short TTL; invalidated on promote
Promote / set stage or alias (gated + audited)
PUT /v1/models/{name}/aliases/prod
{ "version": 42, "reason": "passed canary; +1.2% f1" }
200 OK → alias now points at 42 (atomic; previous target retained for rollback)
403 Forbidden → caller lacks promote permission, or approval gate not satisfied
List / compare versions
GET /v1/models/{name}/versions?stage=staging&sort=-metrics.f1&limit=50
200 OK → { "versions": [ ... ], "next_cursor": "..." }
Get a download URL for the artifact (presigned, edge-served)
GET /v1/artifacts/{digest}/download?ttl=600
200 OK → {
"url": "https://cdn.example.net/blobs/sha256/9f/3a/9f3a...",
"scheme": "p2p+https", // fall back to https if no peer mesh
"expires_at": "2026-06-16T00:10:00Z",
"size_bytes": 214748364800
}
# Client verifies sha256 of the downloaded bytes == digest before loading.
Rollback is not a special endpoint — it is just
PUT .../aliases/prod { "version": 41 }. Because every
prior version’s bytes still exist immutably, recovery is a
pointer flip measured in milliseconds.
High-level design
Producers (training jobs, CI) write metadata + bytes; the registry service owns the strongly-consistent metadata DB and issues presigned URLs into the blob store. The serving fleet resolves an alias, then pulls bytes not from the registry but from a CDN + peer-distribution layer fronting the blob store — that is what survives the deploy fan-out.
flowchart LR
TR["Training Jobs (register)"] --> REG["Registry Service (API)"]
CI["CI / Promotion Pipeline"] --> REG
REG --> MDB[("Metadata DB (versions, lineage, aliases)")]
REG --> CACHE[("Resolve Cache")]
REG --> BLOB["Blob / Object Store (write-once)"]
BLOB --> CDN["CDN / Edge Cache"]
BLOB --> P2P["Peer Distribution (Dragonfly / Kraken)"]
CDN --> FLEET["Serving Fleet (1000s of nodes)"]
P2P --> FLEET
FLEET -->|resolve alias| REG
-
Two stores, one hash. The metadata DB is small and
consistent; the blob store is huge and immutable. A
ModelVersionrow glues them byartifact_digest, so the catalog can be replicated and cached freely while the bytes live wherever is cheapest. - The registry is off the data path. Bytes flow client → CDN/peer mesh → client. The service only resolves metadata and mints presigned URLs, so a 200 GB pull never touches application servers.
- Resolve is cacheable. Version metadata is immutable (cache forever); alias resolution uses a short TTL plus invalidate-on-promote, so a rollout’s read storm is absorbed at the edge.
- Upload before register. Producers push bytes to the blob store first (content-addressed), then register the version referencing the digest — registration is metadata-only and fast, and re-registering the same digest is idempotent.
Deep dive: artifact storage, dedup & the fan-out problem
Artifacts are stored content-addressed: the key is
the sha256 of the bytes. This gives three properties for
free — immutability (the name is the
content), integrity (the client re-hashes and
verifies on pull), and dedup (identical bytes are
stored once, no matter how many versions reference them).
Content-addressed dedup via layered manifests
A model is rarely one monolithic blob; it is a manifest of layers (base weights, adapters, tokenizer, config), each addressed by its own hash. Fine-tunes and re-exports share the heavy base layer, so a new version often costs only the bytes that actually changed:
# Manifest for ModelVersion 42 (sha256:9f3a...)
sha256:9f3a... model-v42.manifest
├─ sha256:aaaa... base_weights.safetensors 14.0 GB ← shared by v40–v44
├─ sha256:bbbb... lora_adapter.safetensors 120 MB ← unique to v42
├─ sha256:cccc... tokenizer.json 2.1 MB ← shared
└─ sha256:dddd... config.json 4 KB
# Registering v43 that only swaps the adapter stores ~120 MB, not 14 GB.
Because layers are immutable, a garbage collector can
safely reclaim any blob no longer referenced by a manifest
(mark-and-sweep over Artifact.layers), and cold versions
can be tiered to cheaper/colder storage classes.
The deploy fan-out problem
Promoting a 200 GB model and rolling it to 1,000 nodes is 200 TB of egress in one window. Served from a single origin it is hopeless; the answer is to serve one copy and let it replicate. Edge seeders/super-peers warm from the origin, then nodes share chunks with each other so origin egress stays near one copy while fleet completion approaches a small multiple of a single-node download:
flowchart LR
ORIG["Origin Object Store (serves ~1 copy)"] --> SEED["Edge Seeders / Super-peers"]
SEED --> N1["Node 1"]
SEED --> N2["Node 2"]
N1 --> N3["Node 3"]
N1 --> N4["Node 4"]
N2 --> N5["Node 5"]
N2 --> N6["Node 6"]
N3 --> N7["Node 7"]
N4 --> N8["Node 8"]
| Strategy | How it works | Trade-off |
|---|---|---|
| Direct from object store | Each node fetches the blob over HTTPS. | Simple, but origin egress is the wall; 1,000 concurrent pulls serialize behind it ⇒ hours. |
| CDN / edge cache | First pull warms an edge POP; the rest hit the cache. | Great for many small/medium artifacts; still O(N) edge egress for one giant blob within a rack/zone. |
| P2P (Dragonfly / Kraken) | BitTorrent-style chunk swarm; nodes seed to each other; origin serves ~1 copy. | Best fleet-wide throughput; adds a peer agent + tracker and intra-cluster traffic to manage. |
| Lazy / streaming load | Pull only the chunks needed to start (lazy-pull images, mmap, demand paging of weights). | Slashes time-to-ready; risk of stalls on cold reads if the network is slow mid-inference. |
Why immutability makes fan-out tractable
Every distribution trick above — CDN caching, P2P swarming,
lazy paging — relies on bytes
never changing under a hash. A chunk fetched from
any peer can be trusted after a local sha256 check, so
caches can be shared globally and a corrupt/poisoned chunk is
detected and refetched rather than served.
Deep dive: lineage, reproducibility & governance
A registry that only stores bytes is a liability; auditors and
on-callers need to answer
“exactly what is in prod, and can we rebuild
it?”
Lineage threads each version back to the run that
made it, and that run back to its dataset version,
code commit, and
container image — a fully pinned recipe.
Governance then guards the path from a registered version to
prod.
flowchart LR
DS["Dataset vN"] --> RUN["Training Run"]
CODE["Code commit (git sha)"] --> RUN
IMG["Container image digest"] --> RUN
HP["Hyperparams"] --> RUN
RUN --> MV["ModelVersion 42"]
MV --> SIGN["Signed manifest (provenance)"]
SIGN --> GATE{"Approval gate"}
GATE -->|approved| PROD["alias: prod"]
GATE -->|rejected| STG["stays in staging"]
-
Reproducibility = pinned inputs. Storing the
dataset version, the git SHA, the image digest, and the
hyperparameters means a run can be re-executed to (near)
bit-identical weights. Floating tags (
:latest, mutable branches) break this — everything is referenced by immutable digest. -
Promotion gates. Moving to
stagingorprodcan require passing evals, a canary, and a human/automated approval. The alias write is rejected unless the gate is satisfied, so policy is enforced at the system-of-record, not in a wiki. - Audit trail. Every register and promote is an append-only event (who, when, from-version → to-version, reason). During an incident you can answer “what changed and who changed it” in seconds, and reconstruct the exact state at any past time.
- Signing & provenance (supply-chain security). Artifacts are signed (e.g. Sigstore-style) and carry attestations binding bytes to the build that produced them. Serving nodes verify the signature and digest before loading, so a tampered or unknown artifact is refused — defending against model poisoning and registry compromise.
-
Access control. Read of a resolve/download URL,
register, and promote are separately authorized; promotion to
prodis the most privileged action and is always recorded.
Bottlenecks & scaling
A model registry almost never falls over on metadata QPS — the catalog is tiny. It degrades on bytes in motion and on a few hot keys during rollouts. The mitigations keep the consistent metadata core while pushing the heavy, immutable data out to caches and peers.
| Bottleneck | Why it hurts | Mitigation |
|---|---|---|
| Deploy fan-out bandwidth | 200 GB × 1,000 nodes = 200 TB from one origin ⇒ the rollout serializes for hours. | CDN edge caching + P2P swarm (Dragonfly/Kraken) so origin serves ~1 copy; dedup so only changed layers move; lazy/streaming load to start before full pull. |
| Metadata hot key |
Every node resolves the same prod alias in a
tight window ⇒ read hotspot.
|
Short-TTL edge/cache on resolve + invalidate-on-promote; immutable version reads cache forever; read replicas for the metadata DB. |
| Large blobs | Single 200 GB transfers stall on any blip; one slow byte kills the whole pull. | Chunk + content-address each layer; multipart, ranged, resumable transfer; parallel chunk fetch; verify per-chunk hashes. |
| Integrity / tampering | A corrupt or poisoned artifact reaching prod is catastrophic and silent. |
Immutable write-once store;
sha256 verify on pull;
signing + provenance; fail-closed if
signature/digest mismatches.
|
| Thundering herd on promote | Flipping the alias makes the whole fleet stampede for the new (cold) bytes at once. | Pre-warm CDN/peer mesh and seeders before the flip; staged/canary rollout; request coalescing at the edge. |
| Storage growth & cost | 250k versions × multi-GB ⇒ petabytes, much of it cold and redundant. | Layer dedup; mark-and-sweep GC of unreferenced blobs; tier cold versions to cheaper classes; retention policies. |
Summary — what a staff answer nails
Frame it as two systems joined by a content hash: a small, strongly-consistent metadata store that versions models, records lineage (run → dataset → code → image), and moves promotion via movable aliases so rollback is a pointer flip; and a huge, immutable, content-addressed blob store that gets integrity, immutability, and dedup for free. Keep the registry off the data path (presigned URLs) and solve the real problem — the deploy fan-out — with CDN + peer-to-peer distribution, lazy loading, and pre-warming, so a 200 GB model reaches a thousand nodes in minutes instead of hours. Guard prod with approval gates, audit, and signed provenance.