Storage & Media
Dropbox — File Storage & Sync
Upload, download, and automatically sync files across devices. The hard parts: moving very large files (up to 50 GB) reliably over flaky networks, keeping every device eventually consistent, and doing it with low latency. The key move is splitting the data plane (bytes in S3, uploaded directly via presigned URLs) from the control plane (metadata in DynamoDB).
Requirements
Functional
- Upload a file; download a file; automatically sync files across devices.
Non-functional
- Availability >> consistency — always-on and convergent; sync is eventually consistent.
- Low-latency uploads/downloads; support files up to 50 GB → resumable uploads are mandatory.
- High data integrity / sync accuracy — device A must faithfully match device B.
- Durability via S3 (11 nines); we don't roll our own blob store.
Scale & back-of-the-envelope
- A 50 GB file @ 100 Mbps ≈ 72 min of transfer — it will be interrupted, so a monolithic POST is unusable.
- Chunk size 5 MB → a 50 GB file = 10,000 chunks, each tracked, fingerprinted, independently retryable.
- The app tier must never proxy 50 GB bodies → direct-to-S3 via presigned URLs.
- Naive change-polling (100M clients every 30 s) ≈ 3.3M req/s → adaptive polling + push.
- Content-hash dedup recovers a large fraction of multi-exabyte raw storage.
API design
POST /files # register metadata, get presigned chunk URLs
PUT {presignedUrl} # client -> S3 directly, one per 5MB chunk
POST /files/:id/chunks/:cid # mark a chunk completed after S3 PUT
POST /files/:id:commit # finalize: status started -> completed
GET /files/:id # metadata + presigned GET URL
GET /changes?since={cursor} # delta sync: returns fileIds[]
Why presigned URLs? The 50 GB body never traverses our servers; S3 handles multipart, retries, and integrity; the URL is short-lived and scoped, so security stays at the control plane.
High-level design
The client (with a local DB + watched folder) talks to the gateway (auth, rate limiting, routing). The File Service issues presigned URLs and writes metadata to DynamoDB; the Sync Service computes "what changed since cursor." Bytes move directly between client and S3.
flowchart LR
subgraph Device["Client Device"]
App["Client App"]
LF["Local Folder"]
end
App --> GW["LB and API Gateway"]
GW -->|upload and getFile| FS["File Service"]
GW -->|getChanges| SS["Sync Service"]
FS --> SS
FS -->|presigned URL| S3["Blob Store (S3)"]
FS -->|write metadata| MD["File Metadata (DynamoDB)"]
SS --> MD
App -.->|direct bytes| S3
S3 -.-> App
Deep dive · chunking + deduplication
The client splits files into 5 MB chunks,
fingerprints each (hash(bytes)), and uploads directly to
S3. Because a chunk's identity is its content hash, dedup and
delta sync fall out for free — editing 1 byte in a 50 GB file touches
one chunk.
flowchart TD
F["File (up to 50GB)"] --> Split["Split into 5MB chunks"]
Split --> H["Fingerprint each chunk hash(bytes)"]
H --> Q{"Fingerprint already in S3?"}
Q -->|Yes| Dedup["Skip upload, reuse object"]
Q -->|No| Up["PUT chunk directly to S3"]
Up --> Mark["Mark chunk completed"]
Dedup --> Mark
Mark --> Commit{"All chunks done?"}
Commit -->|No| Q
Commit -->|Yes| Done["status started to completed"]
Why 5 MB? Small enough that a failed chunk is cheap to retry; large enough that a 50 GB file is "only" 10,000 chunks; aligns with S3 multipart minimums. Trade-off: fixed-size chunking suffers the boundary-shift problem on mid-file inserts — content-defined (Rabin) chunking fixes it at higher CPU cost; fixed-size wins for replace/append workloads.
Deep dive · metadata vs blob split
The most important decision: never store bytes in the database, never store queryable metadata in the blob store.
| Concern | Blob Store (S3) | Metadata DB (DynamoDB) |
|---|---|---|
| Holds | Raw chunk bytes | FileId, chunk list, name, size, status |
| Access | Large sequential blob R/W | Tiny key-value lookups, frequent updates |
| Scale | Exabytes, cheap/GB | Millions of hot items, single-digit ms |
| Client path | Direct via presigned URL | Via File Service |
DynamoDB fits because access is key-based, needs predictable low-latency at high QPS, and its tunable, partition-tolerant model matches "availability >> consistency."
Deep dive · sync & conflict resolution
Two change paths: remote changed → pull &
replace; local changed → upload. Local changes are
detected with native OS file-watch APIs (FSEvents on
macOS, FileSystemWatcher on Windows) — no disk
busy-polling.
sequenceDiagram
participant W as OS Watcher
participant C as Client App
participant SS as Sync Service
participant MD as Metadata DB
W->>C: File changed
C->>C: Diff to find changed chunks
C->>SS: Upload only changed chunks
loop Adaptive polling
C->>SS: GET /changes?since=cursor
SS->>MD: Query fileIds after cursor
MD-->>SS: fileIds[]
SS-->>C: fileIds[] + new cursor
end
- Fast: adaptive polling (frequent when active, back off when idle) + delta sync of only changed chunks.
- Consistent: a cursor on the folder ("seen up to X"), advanced only after apply; a periodic reconciliation pass compares fingerprint manifests to self-heal missed events.
- Conflicts: don't block — keep both and surface a "conflicted copy". Last-writer-wins is simplest but destroys data; the conflicted-copy approach preserves integrity.
Deep dive · large-file resumable upload
The file carries status: started and a
chunks[] array of
{ fingerprint, status, s3Link }.
Resume = ask the server which chunks are already completed
and upload only the rest; a crash costs at most one in-flight 5 MB
chunk.
sequenceDiagram
participant C as Client App
participant FS as File Service
participant MD as Metadata DB
participant S3 as Blob Store
C->>FS: POST /files (metadata + fingerprints)
FS->>MD: Write metadata status=started
FS->>S3: Request presigned URLs (missing chunks only)
FS-->>C: Presigned URLs per pending chunk
loop For each pending chunk
C->>S3: PUT 5MB chunk bytes (direct)
S3-->>C: 200 OK (ETag)
C->>FS: Mark chunk completed
end
C->>FS: Commit upload
FS->>MD: status started to completed
Exposing per-chunk completion state in our own metadata means resumability, parallelism, dedup, and delta sync all share one mechanism. CDN download: because chunks are content-addressed they are immutable and perfectly cacheable — every edge fetch still requires a short-lived presigned URL minted after gateway auth.
Data model
FileMetadata (DynamoDB) # PK FileId; GSI on OwnerId, FolderId
FileId, FolderId, Name, MimeType, Size, OwnerId, S3Link
Status: started -> completed
Chunks: [ { id=fingerprint, status, s3Link, updatedAt } ] # dedup key
Folder { Cursor } # drives GET /changes?since=cursor
User { UserId, ... }
A 50 GB file = 10,000 chunk entries; if an item nears DynamoDB's 400
KB limit, spill Chunks[] into a child table keyed by
(FileId, chunkIndex). A
block_ref(fingerprint → s3Link, refcount) table enables
cross-user dedup + safe garbage collection.
Why it scales
Data plane (S3 + CDN) and control plane (gateway → services → DynamoDB) scale independently. Chunking is the unifying primitive delivering resumability, delta sync, dedup, parallelism, and cache-friendliness all at once.