AI / ML Infrastructure
ML-Optimized Distributed File System
A training cluster is one of the most expensive machines ever built,
and it spends its whole life waiting on a filesystem. The storage
layer has to serve two workloads that pull in opposite directions: a
sustained, sequential read flood as thousands of GPUs
stream a petabyte-scale dataset every epoch, and a
bursty, perfectly synchronized write storm when every
rank dumps a multi-terabyte checkpoint at the same
instant. Layered on top is the
many-small-files problem — a billion 100 KB
images will melt any metadata server long before bandwidth is the
issue. A good design attacks all three: stripe each
file across many data servers for aggregate TB/s,
separate the metadata control plane from the
bulk-data plane and sharded it, tier and cache hot
data on local NVMe so the slow object store is touched as rarely as
possible, and commit checkpoints
atomically and durably so a crash never resumes from
a half-written mess.
Requirements
The shape of the problem comes straight from the I/O patterns of a training job. Unlike a web backend with millions of small independent requests, a training cluster generates a tiny number of enormous, highly correlated I/O patterns — and that correlation is exactly what makes it hard. There are three distinct workloads, and they conflict.
- Bursty checkpoint writes (all ranks at once) — at the checkpoint step, every rank in the job snapshots its shard of the model and optimizer state simultaneously. There is no smooth arrival curve: the filesystem sees idle, idle, idle, then a synchronized multi-terabyte write burst, then idle again. The system must be provisioned for that peak even though the average is low.
- High-throughput sequential dataset reads — every epoch the job streams the full training set. Reads are large, sequential, and read-only, but the aggregate rate across thousands of GPUs is enormous and sustained for the entire run. Here the enemy is total bandwidth and stragglers, not latency of any single read.
- The many-small-files problem — vision and multimodal datasets are often billions of individual files (one image, one audio clip per file). Opening each file is a metadata operation; a billion opens per epoch is a metadata workload that no single namespace server can survive, and the tiny per-file reads waste all the striping the data plane offers.
Functional requirements
- Shared global namespace — every client (every GPU host) sees the same directory tree and the same files, so any rank can read any dataset shard and write any checkpoint shard.
- Parallel read & write — many clients read and write the same large file concurrently, each touching a disjoint byte range, so a single file's bandwidth scales past one server.
-
POSIX-ish or object API — support a POSIX-like
interface
(
open/read/write/rename) for frameworks that expect files, and/or an object API (PUT/GETof large objects) for checkpoint and shard storage. - Atomic checkpoint publish — a checkpoint becomes visible to readers only when it is complete; partial or in-flight writes are never observable.
- Tiered placement — let callers (or policy) place data on a fast NVMe tier or a cheap capacity tier, and move data between tiers.
Non-functional requirements
-
Aggregate bandwidth — multiple
TB/sof read throughput sustained, andTB/s-class write throughput during the checkpoint burst. Bandwidth must scale by adding data servers. - Scalable metadata — millions of metadata ops/sec, scaling out across many metadata servers rather than bottlenecking on one.
- Durability — survive disk, node, and rack loss without losing a committed checkpoint; replication or erasure coding underneath.
- Cost tiers — petabytes are unaffordable on all-NVMe, so blend NVMe, HDD, and object/cold storage and keep only the hot working set on the expensive tier.
- Predictable tail latency — one slow data server should not stall a collective read across thousands of GPUs; the slowest stripe sets the step time.
The filesystem's only job is to keep GPUs fed
Every second a GPU waits on a read() for the next
batch, or stalls inside a synchronous checkpoint, is burned money on
the most expensive hardware in the building. So the design target is
not "fast files" in the abstract — it is
never be the reason a GPU is idle: hide checkpoint
writes behind compute, and serve the dataset fast enough that data
loading is never the bottleneck.
Scale & back-of-the-envelope
Two numbers drive the whole design: the
checkpoint write burst
(how many bytes land how fast when every rank saves at once) and the
sustained dataset read rate (how fast the cluster
drains the dataset every epoch). A third —
metadata ops/sec — is what quietly kills you on
small-file datasets. Checkpoint footprint follows the mixed-precision
Adam rule of ~14 bytes per parameter (bf16 weights +
fp32 master + Adam m + Adam v).
| Quantity | Figure | Math / why it matters |
|---|---|---|
| Checkpoint footprint / param | ~14 bytes |
bf16 weights (2) + fp32 master (4) + Adam m (4) + Adam v (4). |
| 1T-param checkpoint | ~14 TB |
1e12 x 14 — the full state written in one burst. |
| Ranks writing at once | ~8,000–10,000 |
All TP x PP x DP ranks snapshot simultaneously (de-dup DP replicas first). |
| Write window (to hide in compute) | ~10–20 s |
Async stage to NVMe, then drain; the burst must clear before the next save. |
| Required aggregate write BW | ~0.7–1.4 TB/s |
14 TB / 10–20 s — the burst the data tier must absorb. |
| Data servers for the burst | ~150–280 |
1.4 TB/s / ~5 GB/s per server — striping is mandatory, one server can't. |
| Dataset size | ~1–10 PB |
Web-scale text or billions of images/clips; far past any single node. |
| Sustained read BW | ~2–4 TB/s |
~1,000 nodes x ~2–4 GB/s each, every epoch, for the whole run. |
| Uncached epoch I/O time | ~42 min |
5 PB / 2 TB/s — why hot shards must be cached on NVMe, not re-read cold. |
| Small-file dataset | 1e9 files x 100 KB |
~100 TB of bytes but a billion inodes / opens per epoch. |
| Metadata ops for one epoch | ~5.5 h @ 50k ops/s |
1e9 opens / 5e4 — a single MDS spends hours just opening files. |
| After packing into shards | ~1e6 shards |
1,000 files per WebDataset tar -> 1000x fewer metadata ops, big sequential reads. |
The punchline of the table is that no single server does any of this. A 1.4 TB/s write burst needs hundreds of data servers in parallel; a 2–4 TB/s sustained read needs the hot set cached on NVMe near the GPUs; and a billion-file dataset needs the files packed away before the metadata server ever sees them. Each of the sections below is the mechanism for one of those three pressures.
Architecture: a parallel filesystem
The foundational idea — shared by Lustre, IBM GPFS / Spectrum Scale, and BeeGFS, and mirrored by object-store designs like DAOS and WekaFS — is to split the control plane from the data plane. A metadata server (MDS) owns the namespace: directories, file names, permissions, and crucially each file's layout (which data servers hold it, the stripe size, the stripe count). Separate data servers (Lustre calls them OSS hosts managing OST targets) own only opaque byte ranges and do the bulk I/O.
Striping is where the bandwidth comes from
A single file is striped across many data servers:
the file is cut into fixed-size stripes (say 1 MB)
distributed round-robin over N targets. A client reads
stripe 0 from server A, stripe 1 from server B, stripe 2 from server
C, and so on — all in parallel. The file's bandwidth is
therefore N x a single server's bandwidth, capped only by
the client's network and the number of stripes. A 14 TB
checkpoint striped over 200 servers is 200 independent flows, not one.
The open path vs the I/O path
This separation makes the hot path cheap. On open, the
client makes one request to the MDS and gets back the
layout. After that, all the heavy read/write
traffic goes directly from client to the data servers
— the metadata server is never in the data path. This is
client-side striping: the client library (or a kernel
mount) scatters writes and gathers reads across the targets itself, so
there is no central proxy to bottleneck on. Control-plane requests are
small and infrequent; data-plane bytes are huge and parallel.
flowchart TD
C0["Client 0 (GPU host)"]
C1["Client 1 (GPU host)"]
CN["Client N (GPU host)"]
MDS["Metadata Server (namespace + file layout)"]
D0["Data server 0 (OST)"]
D1["Data server 1 (OST)"]
D2["Data server 2 (OST)"]
D3["Data server 3 (OST)"]
C0 -->|"1. open: get layout"| MDS
C1 --> MDS
CN --> MDS
C0 -->|"2. parallel stripe I/O"| D0
C0 --> D1
C0 --> D2
C0 --> D3
C1 --> D0
C1 --> D2
CN --> D1
CN --> D3
Clients hit the metadata server once to fetch a file's layout, then stream bytes directly and in parallel to every data server that holds a stripe. The metadata server stays out of the bulk-data path, so aggregate bandwidth scales with the number of data servers.
Separate the plane that scales by bytes from the plane that scales by ops
Bulk I/O scales by bandwidth (add data servers and stripes); the namespace scales by operations (opens, creates, renames). Bolting them together means one workload starves the other. Keeping them apart lets you grow a 1.4 TB/s data tier and a million-op/s metadata tier independently — and is why both layers get their own deep dive below.
Tiering & caching: the burst buffer
Petabytes on all-NVMe is unaffordable, and a cold object store is too slow to feed GPUs directly. The resolution is a storage hierarchy with a fast tier in front of a cheap one, plus aggressive caching of whatever the job actually touches. The fast tier is often a burst buffer: local or near-compute NVMe (Cray DataWarp, Lustre PCC, DAOS, WekaFS, or an Alluxio-style cache) that absorbs spikes and serves hot data.
Write-back buffering for checkpoints
The checkpoint burst is the textbook case for
write-back. Ranks write their shards to the
NVMe burst buffer at local speed — that clears the
synchronized burst in seconds — and a background drainer
asynchronously copies the checkpoint down to the durable object store.
Training resumes the instant the fast write lands; the slow durable
write happens off the critical path. The burst buffer turns a
TB/s spike that the object store could never absorb into
a smooth trickle it can.
Read caching of hot dataset shards
Dataset reads have strong reuse: the same shards are read once per epoch, epoch after epoch. Caching the working set on NVMe near the GPUs means after the first epoch most reads are local — no repeated cold pulls from the object store at 2 TB/s. Prefetching the next shards while the current batch trains hides latency entirely. The cache is keyed by shard, and only the hot set needs to fit; cold shards stream on demand.
Data tiering: NVMe to HDD to object/cold
Behind the cache, data ages down a tier ladder by policy. The active checkpoint and the current dataset working set live on NVMe; older milestones and less-used datasets move to HDD; archives and cold checkpoints land in cheap object/cold storage. Each tier trades cost for latency and bandwidth, and data flows down on age and back up on demand (prefetch).
flowchart LR
GPU["GPU / training client"]
BB["Burst buffer (local NVMe cache)"]
HOT["Hot tier (NVMe parallel FS)"]
WARM["Warm tier (HDD)"]
COLD["Cold / capacity (object store)"]
GPU -->|"checkpoint write-back"| BB
BB -->|"async drain"| HOT
HOT -->|"age out"| WARM
WARM -->|"tier down"| COLD
COLD -->|"prefetch hot shards"| BB
BB -->|"cached reads"| GPU
Checkpoints write back to the NVMe burst buffer and drain asynchronously; dataset shards are prefetched up into the cache and served locally. Data ages down the NVMe to HDD to object ladder, keeping only the hot working set on the expensive tier.
The fast tier exists so you rarely touch the slow one
Both the checkpoint burst and the epoch read flood are absorbed by NVMe near the compute; the durable object store sees only the smoothed residue — background drains and first-touch misses. Size the burst buffer to hold one or two checkpoints plus the dataset working set, and the cluster runs at NVMe speed while paying mostly HDD/object prices for capacity.
Metadata scaling & the small-file problem
Bandwidth scales by adding data servers — but the namespace does not
come along for free. Every open, create,
stat, and rename is a metadata operation,
and a single metadata server tops out at roughly tens of thousands of
ops/sec. The dataset workload can demand millions. Two things
have to be true: metadata must scale out, and the workload must stop
generating so much of it.
Distributed / sharded metadata
Scale the namespace across many metadata servers. The common strategies:
- Hash partitioning — shard by a hash of the path or parent inode, spreading load evenly but scattering directory locality (Lustre DNE, BeeGFS).
- Dynamic subtree partitioning — assign whole subtrees to metadata servers and rebalance hot subtrees adaptively, preserving locality (CephFS).
- Object-store backends — collapse the namespace into a key/value or object layer (DAOS, S3-style) where "metadata" is just keys, sidestepping a classic POSIX MDS entirely.
Why millions of small files kill the metadata server
A billion-file dataset is a metadata workload disguised as a storage workload. Each file is an inode plus a directory entry plus permission bits; reading it costs a metadata round trip before a single data byte moves. A billion opens per epoch at 50k ops/sec is hours of pure metadata. Worse, each read is tiny — a 100 KB file does not even fill one stripe, so all the data-plane parallelism is wasted on micro-I/Os, and the disks thrash on random seeks instead of streaming.
The fix: stop having small files
The decisive move is to
pack many small samples into a few large shards
before training ever starts — WebDataset tar
archives, TFRecord, MosaicML MDS, or
RecordIO. A thousand images become one multi-hundred-MB shard. Now the
dataset is ~1e6 large files instead of 1e9 tiny ones:
1000x fewer metadata ops, large sequential reads that
actually use striping, and a randomized sample order recovered with an
in-memory shuffle buffer rather than random file
access.
| Approach | Metadata cost | Trade-off |
|---|---|---|
| One file per sample | 1 op per sample (billions) | Simple, true random access; destroys the MDS and wastes striping. |
| Packed shards (WebDataset/TFRecord) | 1 op per ~1,000 samples | Huge sequential reads; needs a shuffle buffer, costly to rewrite/edit. |
| Sharded MDS (DNE / subtree) | Spreads ops over many servers | Scales the namespace; cross-shard renames and locality get harder. |
| Object-store backend | Keys, not inodes | Massive flat scale; weaker POSIX semantics (no cheap rename/append). |
Pack the files, then scale what's left
Sharding the metadata server is necessary but not sufficient — the bigger win is never generating a billion opens in the first place. Pack small samples into large shards so the data plane streams instead of seeks, and reserve the distributed MDS for the checkpoints, manifests, and shard files that remain.
Consistency & durability
Training tolerates relaxed general-purpose consistency but demands strict guarantees in exactly two places: a checkpoint must commit atomically, and a committed checkpoint must never be lost. The art is relaxing everything else for speed while keeping those two ironclad.
Write atomicity for checkpoints
A checkpoint is many shards written by many ranks; readers must see
either the whole thing or none of it. The standard recipe is
write-to-temp then atomic publish: every rank writes
its shard into a temporary path, a barrier ensures all shards
are durable, and only then does a single
atomic rename / manifest commit make checkpoint
N visible. A reader always resolves the newest
complete checkpoint; an interrupted save is just an orphaned
temp dir that is garbage-collected. This is the same invariant as the
companion
Distributed Checkpointing
design — the filesystem provides the atomic-rename primitive it relies
on.
sequenceDiagram
participant R as Ranks (parallel writers)
participant FS as Parallel filesystem
participant MDS as Metadata / manifest
R->>FS: write disjoint shards to temp path
FS-->>R: stripes durable (replicated / erasure-coded)
R->>R: barrier - all ranks finished
R->>MDS: commit manifest (atomic rename)
MDS-->>R: checkpoint N now visible
Note over R,MDS: readers only ever see complete checkpoints
Each rank writes a disjoint shard to a temp location; after a barrier confirms every stripe is durable, a single atomic manifest commit publishes the checkpoint. There is no window in which a partial checkpoint is visible.
Durability: replication vs erasure coding
Under the data servers, bytes survive hardware loss via redundancy:
- Replication (e.g. 3x) — simple, fast to read and rebuild, but 200% storage overhead. Good for hot, small, latency-sensitive data.
- Erasure coding (e.g. 8+3) — ~37% overhead for similar fault tolerance, at the cost of read-modify-write penalties on partial writes and heavier CPU/network on rebuild. Ideal for large, write-once checkpoint and dataset shards.
Because checkpoints and packed dataset shards are large and write-once, erasure coding is the natural fit — full-stripe writes avoid the read-modify-write tax, and the capacity savings at petabyte scale are decisive.
Concurrent writers & relaxed POSIX
Strict POSIX byte-range coherency across thousands of clients would
require constant lock traffic and crush throughput. Training does not
need it: ranks write
disjoint byte ranges or separate shard files, so
there is no true write-write conflict. The filesystem can therefore
relax to flush-on-close / commit-on-rename semantics,
use O_DIRECT to bypass coherent client caches on the
write path, and skip cross-client cache invalidation — trading POSIX
strictness the workload never exercises for raw bandwidth.
Failure recovery
A dead disk or data server is rebuilt from replicas or parity in the background while reads transparently fall back to surviving copies. A failed metadata server fails over to a standby that replays its journal, so the namespace is consistent on restart. A job that crashes mid-write loses only the uncommitted temp data; the last published checkpoint is untouched, so recovery is always to a clean, complete point.
Relax everything except the commit
The design trades away strict cross-client POSIX coherency — which training never uses — to win bandwidth, while holding two invariants absolutely: a checkpoint publishes atomically, and a committed checkpoint is durable against node and rack loss. Everything fast is built on top of those two things being slow and certain.
Bottlenecks & scaling
Each mechanism above relieves one pressure and eventually exposes the next. The recurring pattern is the same as the rest of ML infrastructure: spend network and fast storage to convert a synchronized burst into a smooth, parallel flow, and keep the slow durable tier off the critical path.
| Bottleneck | Symptom | Mitigation |
|---|---|---|
| Metadata server | Opens/stats throttle; small-file epochs spend hours in metadata | Shard the MDS (hash / subtree), pack samples into large shards, object-store backend |
| Write bandwidth bursts | All ranks write at once; object store can't absorb the TB/s spike | NVMe burst buffer + write-back, stripe across many data servers, de-dup DP replicas |
| Small files | Tiny random I/Os waste striping; disks seek instead of stream | WebDataset / TFRecord packing, shuffle buffer for randomization, large sequential reads |
| Tail latency / stragglers | One slow data server stalls a collective read; slowest stripe sets step time | Wide striping, replicas for hot data, hedged reads, drain stragglers, balanced placement |
| Read re-fetch | Same dataset shards pulled cold from object store every epoch | NVMe read cache of the hot working set, prefetch next shards, keep cache warm across epochs |
| Cost / capacity | Petabytes on all-NVMe is unaffordable | Tier NVMe to HDD to object; keep only hot data fast, erasure-code cold capacity |
| Rebuild / durability | Disk loss triggers heavy erasure rebuild; partial writes pay read-modify-write | Full-stripe write-once layout, replication for hot/small data, background rebuild |
Summary
An ML-optimized filesystem exists to keep the world's most expensive
hardware fed through two opposite storms.
Stripe every file across many data servers and
separate that data plane from a
sharded metadata control plane, so bandwidth scales
to TB/s and the namespace scales to millions of ops/sec
independently. Tier and cache on an NVMe burst
buffer — write-back to smooth the checkpoint burst, read-cache and
prefetch the hot dataset shards — so the cheap object store sees
only smoothed residue. Pack
billions of small samples into a few large shards so the data plane
streams instead of thrashing and the metadata server survives. And
commit checkpoints
atomically over
erasure-coded, durable storage, with relaxed POSIX
everywhere the workload doesn't need it. Get those right and storage
disappears as a concern — the GPUs stay busy, and the filesystem is
never the reason a step is slow.