AI / ML Infrastructure

ML-Optimized Distributed File System

A training cluster is one of the most expensive machines ever built, and it spends its whole life waiting on a filesystem. The storage layer has to serve two workloads that pull in opposite directions: a sustained, sequential read flood as thousands of GPUs stream a petabyte-scale dataset every epoch, and a bursty, perfectly synchronized write storm when every rank dumps a multi-terabyte checkpoint at the same instant. Layered on top is the many-small-files problem — a billion 100 KB images will melt any metadata server long before bandwidth is the issue. A good design attacks all three: stripe each file across many data servers for aggregate TB/s, separate the metadata control plane from the bulk-data plane and sharded it, tier and cache hot data on local NVMe so the slow object store is touched as rarely as possible, and commit checkpoints atomically and durably so a crash never resumes from a half-written mess.

Requirements

The shape of the problem comes straight from the I/O patterns of a training job. Unlike a web backend with millions of small independent requests, a training cluster generates a tiny number of enormous, highly correlated I/O patterns — and that correlation is exactly what makes it hard. There are three distinct workloads, and they conflict.

Bursty checkpoint writes (all ranks at once) — at the checkpoint step, every rank in the job snapshots its shard of the model and optimizer state simultaneously. There is no smooth arrival curve: the filesystem sees idle, idle, idle, then a synchronized multi-terabyte write burst, then idle again. The system must be provisioned for that peak even though the average is low.
High-throughput sequential dataset reads — every epoch the job streams the full training set. Reads are large, sequential, and read-only, but the aggregate rate across thousands of GPUs is enormous and sustained for the entire run. Here the enemy is total bandwidth and stragglers, not latency of any single read.
The many-small-files problem — vision and multimodal datasets are often billions of individual files (one image, one audio clip per file). Opening each file is a metadata operation; a billion opens per epoch is a metadata workload that no single namespace server can survive, and the tiny per-file reads waste all the striping the data plane offers.

Functional requirements

Shared global namespace — every client (every GPU host) sees the same directory tree and the same files, so any rank can read any dataset shard and write any checkpoint shard.
Parallel read & write — many clients read and write the same large file concurrently, each touching a disjoint byte range, so a single file's bandwidth scales past one server.
POSIX-ish or object API — support a POSIX-like interface (open/read/write/rename) for frameworks that expect files, and/or an object API (PUT/ GET of large objects) for checkpoint and shard storage.
Atomic checkpoint publish — a checkpoint becomes visible to readers only when it is complete; partial or in-flight writes are never observable.
Tiered placement — let callers (or policy) place data on a fast NVMe tier or a cheap capacity tier, and move data between tiers.

Non-functional requirements

Aggregate bandwidth — multiple TB/s of read throughput sustained, and TB/s-class write throughput during the checkpoint burst. Bandwidth must scale by adding data servers.
Scalable metadata — millions of metadata ops/sec, scaling out across many metadata servers rather than bottlenecking on one.
Durability — survive disk, node, and rack loss without losing a committed checkpoint; replication or erasure coding underneath.
Cost tiers — petabytes are unaffordable on all-NVMe, so blend NVMe, HDD, and object/cold storage and keep only the hot working set on the expensive tier.
Predictable tail latency — one slow data server should not stall a collective read across thousands of GPUs; the slowest stripe sets the step time.

The filesystem's only job is to keep GPUs fed

Every second a GPU waits on a read() for the next batch, or stalls inside a synchronous checkpoint, is burned money on the most expensive hardware in the building. So the design target is not "fast files" in the abstract — it is never be the reason a GPU is idle: hide checkpoint writes behind compute, and serve the dataset fast enough that data loading is never the bottleneck.

Scale & back-of-the-envelope

Two numbers drive the whole design: the checkpoint write burst (how many bytes land how fast when every rank saves at once) and the sustained dataset read rate (how fast the cluster drains the dataset every epoch). A third — metadata ops/sec — is what quietly kills you on small-file datasets. Checkpoint footprint follows the mixed-precision Adam rule of ~14 bytes per parameter (bf16 weights + fp32 master + Adam m + Adam v).

Quantity	Figure	Math / why it matters
Checkpoint footprint / param	`~14 bytes`	bf16 weights (2) + fp32 master (4) + Adam m (4) + Adam v (4).
1T-param checkpoint	`~14 TB`	1e12 x 14 — the full state written in one burst.
Ranks writing at once	`~8,000–10,000`	All TP x PP x DP ranks snapshot simultaneously (de-dup DP replicas first).
Write window (to hide in compute)	`~10–20 s`	Async stage to NVMe, then drain; the burst must clear before the next save.
Required aggregate write BW	`~0.7–1.4 TB/s`	14 TB / 10–20 s — the burst the data tier must absorb.
Data servers for the burst	`~150–280`	1.4 TB/s / ~5 GB/s per server — striping is mandatory, one server can't.
Dataset size	`~1–10 PB`	Web-scale text or billions of images/clips; far past any single node.
Sustained read BW	`~2–4 TB/s`	~1,000 nodes x ~2–4 GB/s each, every epoch, for the whole run.
Uncached epoch I/O time	`~42 min`	5 PB / 2 TB/s — why hot shards must be cached on NVMe, not re-read cold.
Small-file dataset	`1e9 files x 100 KB`	~100 TB of bytes but a billion inodes / opens per epoch.
Metadata ops for one epoch	`~5.5 h @ 50k ops/s`	1e9 opens / 5e4 — a single MDS spends hours just opening files.
After packing into shards	`~1e6 shards`	1,000 files per WebDataset tar -> 1000x fewer metadata ops, big sequential reads.

The punchline of the table is that no single server does any of this. A 1.4 TB/s write burst needs hundreds of data servers in parallel; a 2–4 TB/s sustained read needs the hot set cached on NVMe near the GPUs; and a billion-file dataset needs the files packed away before the metadata server ever sees them. Each of the sections below is the mechanism for one of those three pressures.

Architecture: a parallel filesystem

The foundational idea — shared by Lustre, IBM GPFS / Spectrum Scale, and BeeGFS, and mirrored by object-store designs like DAOS and WekaFS — is to split the control plane from the data plane. A metadata server (MDS) owns the namespace: directories, file names, permissions, and crucially each file's layout (which data servers hold it, the stripe size, the stripe count). Separate data servers (Lustre calls them OSS hosts managing OST targets) own only opaque byte ranges and do the bulk I/O.

Striping is where the bandwidth comes from

A single file is striped across many data servers: the file is cut into fixed-size stripes (say 1 MB) distributed round-robin over N targets. A client reads stripe 0 from server A, stripe 1 from server B, stripe 2 from server C, and so on — all in parallel. The file's bandwidth is therefore N x a single server's bandwidth, capped only by the client's network and the number of stripes. A 14 TB checkpoint striped over 200 servers is 200 independent flows, not one.

The open path vs the I/O path

This separation makes the hot path cheap. On open, the client makes one request to the MDS and gets back the layout. After that, all the heavy read/write traffic goes directly from client to the data servers — the metadata server is never in the data path. This is client-side striping: the client library (or a kernel mount) scatters writes and gathers reads across the targets itself, so there is no central proxy to bottleneck on. Control-plane requests are small and infrequent; data-plane bytes are huge and parallel.

flowchart TD
  C0["Client 0 (GPU host)"]
  C1["Client 1 (GPU host)"]
  CN["Client N (GPU host)"]
  MDS["Metadata Server (namespace + file layout)"]
  D0["Data server 0 (OST)"]
  D1["Data server 1 (OST)"]
  D2["Data server 2 (OST)"]
  D3["Data server 3 (OST)"]
  C0 -->|"1. open: get layout"| MDS
  C1 --> MDS
  CN --> MDS
  C0 -->|"2. parallel stripe I/O"| D0
  C0 --> D1
  C0 --> D2
  C0 --> D3
  C1 --> D0
  C1 --> D2
  CN --> D1
  CN --> D3

Clients hit the metadata server once to fetch a file's layout, then stream bytes directly and in parallel to every data server that holds a stripe. The metadata server stays out of the bulk-data path, so aggregate bandwidth scales with the number of data servers.

Separate the plane that scales by bytes from the plane that scales by ops

Bulk I/O scales by bandwidth (add data servers and stripes); the namespace scales by operations (opens, creates, renames). Bolting them together means one workload starves the other. Keeping them apart lets you grow a 1.4 TB/s data tier and a million-op/s metadata tier independently — and is why both layers get their own deep dive below.

Tiering & caching: the burst buffer

Petabytes on all-NVMe is unaffordable, and a cold object store is too slow to feed GPUs directly. The resolution is a storage hierarchy with a fast tier in front of a cheap one, plus aggressive caching of whatever the job actually touches. The fast tier is often a burst buffer: local or near-compute NVMe (Cray DataWarp, Lustre PCC, DAOS, WekaFS, or an Alluxio-style cache) that absorbs spikes and serves hot data.

Write-back buffering for checkpoints

The checkpoint burst is the textbook case for write-back. Ranks write their shards to the NVMe burst buffer at local speed — that clears the synchronized burst in seconds — and a background drainer asynchronously copies the checkpoint down to the durable object store. Training resumes the instant the fast write lands; the slow durable write happens off the critical path. The burst buffer turns a TB/s spike that the object store could never absorb into a smooth trickle it can.

Read caching of hot dataset shards

Dataset reads have strong reuse: the same shards are read once per epoch, epoch after epoch. Caching the working set on NVMe near the GPUs means after the first epoch most reads are local — no repeated cold pulls from the object store at 2 TB/s. Prefetching the next shards while the current batch trains hides latency entirely. The cache is keyed by shard, and only the hot set needs to fit; cold shards stream on demand.

Data tiering: NVMe to HDD to object/cold

Behind the cache, data ages down a tier ladder by policy. The active checkpoint and the current dataset working set live on NVMe; older milestones and less-used datasets move to HDD; archives and cold checkpoints land in cheap object/cold storage. Each tier trades cost for latency and bandwidth, and data flows down on age and back up on demand (prefetch).

flowchart LR
  GPU["GPU / training client"]
  BB["Burst buffer (local NVMe cache)"]
  HOT["Hot tier (NVMe parallel FS)"]
  WARM["Warm tier (HDD)"]
  COLD["Cold / capacity (object store)"]
  GPU -->|"checkpoint write-back"| BB
  BB -->|"async drain"| HOT
  HOT -->|"age out"| WARM
  WARM -->|"tier down"| COLD
  COLD -->|"prefetch hot shards"| BB
  BB -->|"cached reads"| GPU

Checkpoints write back to the NVMe burst buffer and drain asynchronously; dataset shards are prefetched up into the cache and served locally. Data ages down the NVMe to HDD to object ladder, keeping only the hot working set on the expensive tier.

The fast tier exists so you rarely touch the slow one

Both the checkpoint burst and the epoch read flood are absorbed by NVMe near the compute; the durable object store sees only the smoothed residue — background drains and first-touch misses. Size the burst buffer to hold one or two checkpoints plus the dataset working set, and the cluster runs at NVMe speed while paying mostly HDD/object prices for capacity.

Metadata scaling & the small-file problem

Bandwidth scales by adding data servers — but the namespace does not come along for free. Every open, create, stat, and rename is a metadata operation, and a single metadata server tops out at roughly tens of thousands of ops/sec. The dataset workload can demand millions. Two things have to be true: metadata must scale out, and the workload must stop generating so much of it.

Distributed / sharded metadata

Scale the namespace across many metadata servers. The common strategies:

Hash partitioning — shard by a hash of the path or parent inode, spreading load evenly but scattering directory locality (Lustre DNE, BeeGFS).
Dynamic subtree partitioning — assign whole subtrees to metadata servers and rebalance hot subtrees adaptively, preserving locality (CephFS).
Object-store backends — collapse the namespace into a key/value or object layer (DAOS, S3-style) where "metadata" is just keys, sidestepping a classic POSIX MDS entirely.

Why millions of small files kill the metadata server

A billion-file dataset is a metadata workload disguised as a storage workload. Each file is an inode plus a directory entry plus permission bits; reading it costs a metadata round trip before a single data byte moves. A billion opens per epoch at 50k ops/sec is hours of pure metadata. Worse, each read is tiny — a 100 KB file does not even fill one stripe, so all the data-plane parallelism is wasted on micro-I/Os, and the disks thrash on random seeks instead of streaming.

The fix: stop having small files

The decisive move is to pack many small samples into a few large shards before training ever starts — WebDataset tar archives, TFRecord, MosaicML MDS, or RecordIO. A thousand images become one multi-hundred-MB shard. Now the dataset is ~1e6 large files instead of 1e9 tiny ones: 1000x fewer metadata ops, large sequential reads that actually use striping, and a randomized sample order recovered with an in-memory shuffle buffer rather than random file access.

Approach	Metadata cost	Trade-off
One file per sample	1 op per sample (billions)	Simple, true random access; destroys the MDS and wastes striping.
Packed shards (WebDataset/TFRecord)	1 op per ~1,000 samples	Huge sequential reads; needs a shuffle buffer, costly to rewrite/edit.
Sharded MDS (DNE / subtree)	Spreads ops over many servers	Scales the namespace; cross-shard renames and locality get harder.
Object-store backend	Keys, not inodes	Massive flat scale; weaker POSIX semantics (no cheap rename/append).

Pack the files, then scale what's left

Sharding the metadata server is necessary but not sufficient — the bigger win is never generating a billion opens in the first place. Pack small samples into large shards so the data plane streams instead of seeks, and reserve the distributed MDS for the checkpoints, manifests, and shard files that remain.

Consistency & durability

Training tolerates relaxed general-purpose consistency but demands strict guarantees in exactly two places: a checkpoint must commit atomically, and a committed checkpoint must never be lost. The art is relaxing everything else for speed while keeping those two ironclad.

Write atomicity for checkpoints

A checkpoint is many shards written by many ranks; readers must see either the whole thing or none of it. The standard recipe is write-to-temp then atomic publish: every rank writes its shard into a temporary path, a barrier ensures all shards are durable, and only then does a single atomic rename / manifest commit make checkpoint N visible. A reader always resolves the newest complete checkpoint; an interrupted save is just an orphaned temp dir that is garbage-collected. This is the same invariant as the companion Distributed Checkpointing design — the filesystem provides the atomic-rename primitive it relies on.

sequenceDiagram
  participant R as Ranks (parallel writers)
  participant FS as Parallel filesystem
  participant MDS as Metadata / manifest
  R->>FS: write disjoint shards to temp path
  FS-->>R: stripes durable (replicated / erasure-coded)
  R->>R: barrier - all ranks finished
  R->>MDS: commit manifest (atomic rename)
  MDS-->>R: checkpoint N now visible
  Note over R,MDS: readers only ever see complete checkpoints

Each rank writes a disjoint shard to a temp location; after a barrier confirms every stripe is durable, a single atomic manifest commit publishes the checkpoint. There is no window in which a partial checkpoint is visible.

Durability: replication vs erasure coding

Under the data servers, bytes survive hardware loss via redundancy:

Replication (e.g. 3x) — simple, fast to read and rebuild, but 200% storage overhead. Good for hot, small, latency-sensitive data.
Erasure coding (e.g. 8+3) — ~37% overhead for similar fault tolerance, at the cost of read-modify-write penalties on partial writes and heavier CPU/network on rebuild. Ideal for large, write-once checkpoint and dataset shards.

Because checkpoints and packed dataset shards are large and write-once, erasure coding is the natural fit — full-stripe writes avoid the read-modify-write tax, and the capacity savings at petabyte scale are decisive.

Concurrent writers & relaxed POSIX

Strict POSIX byte-range coherency across thousands of clients would require constant lock traffic and crush throughput. Training does not need it: ranks write disjoint byte ranges or separate shard files, so there is no true write-write conflict. The filesystem can therefore relax to flush-on-close / commit-on-rename semantics, use O_DIRECT to bypass coherent client caches on the write path, and skip cross-client cache invalidation — trading POSIX strictness the workload never exercises for raw bandwidth.

Failure recovery

A dead disk or data server is rebuilt from replicas or parity in the background while reads transparently fall back to surviving copies. A failed metadata server fails over to a standby that replays its journal, so the namespace is consistent on restart. A job that crashes mid-write loses only the uncommitted temp data; the last published checkpoint is untouched, so recovery is always to a clean, complete point.

Relax everything except the commit

The design trades away strict cross-client POSIX coherency — which training never uses — to win bandwidth, while holding two invariants absolutely: a checkpoint publishes atomically, and a committed checkpoint is durable against node and rack loss. Everything fast is built on top of those two things being slow and certain.

Bottlenecks & scaling

Each mechanism above relieves one pressure and eventually exposes the next. The recurring pattern is the same as the rest of ML infrastructure: spend network and fast storage to convert a synchronized burst into a smooth, parallel flow, and keep the slow durable tier off the critical path.

Bottleneck	Symptom	Mitigation
Metadata server	Opens/stats throttle; small-file epochs spend hours in metadata	Shard the MDS (hash / subtree), pack samples into large shards, object-store backend
Write bandwidth bursts	All ranks write at once; object store can't absorb the TB/s spike	NVMe burst buffer + write-back, stripe across many data servers, de-dup DP replicas
Small files	Tiny random I/Os waste striping; disks seek instead of stream	WebDataset / TFRecord packing, shuffle buffer for randomization, large sequential reads
Tail latency / stragglers	One slow data server stalls a collective read; slowest stripe sets step time	Wide striping, replicas for hot data, hedged reads, drain stragglers, balanced placement
Read re-fetch	Same dataset shards pulled cold from object store every epoch	NVMe read cache of the hot working set, prefetch next shards, keep cache warm across epochs
Cost / capacity	Petabytes on all-NVMe is unaffordable	Tier NVMe to HDD to object; keep only hot data fast, erasure-code cold capacity
Rebuild / durability	Disk loss triggers heavy erasure rebuild; partial writes pay read-modify-write	Full-stripe write-once layout, replication for hot/small data, background rebuild

Summary

An ML-optimized filesystem exists to keep the world's most expensive hardware fed through two opposite storms. Stripe every file across many data servers and separate that data plane from a sharded metadata control plane, so bandwidth scales to TB/s and the namespace scales to millions of ops/sec independently. Tier and cache on an NVMe burst buffer — write-back to smooth the checkpoint burst, read-cache and prefetch the hot dataset shards — so the cheap object store sees only smoothed residue. Pack billions of small samples into a few large shards so the data plane streams instead of thrashing and the metadata server survives. And commit checkpoints atomically over erasure-coded, durable storage, with relaxed POSIX everywhere the workload doesn't need it. Get those right and storage disappears as a concern — the GPUs stay busy, and the filesystem is never the reason a step is slow.