Storage & Media
YouTube — Video Platform
Upload large videos (resumable, chunked, direct-to-blob) and watch them — transcoded into multiple renditions and delivered via adaptive bitrate streaming over a CDN. Targets: files up to 256 GB, time-to-first-frame < 500 ms even on low bandwidth, 1M uploads/day and 100M views.
Requirements
Functional
- Upload videos; watch / stream videos. (Comments, subscriptions, recommendations, search are extensions.)
Non-functional
- Availability >> consistency for uploads — accept the write, process asynchronously.
- Support large videos up to 256 GB; low-latency streaming (< 500 ms TTFF) even on low bandwidth.
- Scale to 1M uploads/day and 100M views.
Scale & back-of-the-envelope
- 1M uploads/day ≈ ~12/s avg, ~40–60/s peak. Metadata ≈ 1M × 1 KB × 365 ≈ 0.35 TB/yr (the challenge is read QPS, not size).
- Media storage ≈ ~1.5 PB/day with renditions → tiered/cold storage + transcode-on-demand for the long tail.
- Streaming egress ≈ ~2M concurrent streams × ~3 Mbps ≈ ~6–20 Tbps (~67 PB/day) → a CDN is mandatory; the origin must be shielded.
API design
POST /videos # initiate: create record, return videoId + presigned part URLs
PUT {presignedUrl} # upload one ~10MB part directly to S3 (resumable)
POST /videos/:id/complete # finalize multipart upload (ETag list)
GET /videos/:id # metadata + manifest (master playlist) URL
GET /videos/:id/status # pending | uploaded | chunking | chunks_complete
GET {cdn}/videos/:id/master.m3u8 # HLS/DASH manifest from CDN
High-level design
The client requests presigned URLs and uploads ~10 MB chunks directly to S3; an S3 event triggers a Chunker that splits the source into 2–10 s clips, which fan out to per-resolution Transcoders. Rendered segments return to S3, notifications update the metadata DB, and the CDN serves popular segments.
flowchart LR
Client["Client (10MB chunks)"]
AG["API Gateway"]
VS["Video Service"]
DB[("Metadata DB (Postgres)")]
S3[("S3 Blob Store")]
CH["Chunker (2-10s clips)"]
T240["Transcoder 240p"]
T720["Transcoder 720p"]
T1080["Transcoder 1080p"]
CDN(("CDN edge"))
Viewer["Viewer / Player"]
Client -->|"ask presigned URLs"| AG
AG --> VS
VS -->|"presigned multipart URLs"| S3
VS -->|"store metadata"| DB
Client -->|"multipart upload direct"| S3
S3 -->|"s3 notification"| CH
CH --> T240 & T720 & T1080
T240 --> S3
T720 --> S3
T1080 --> S3
S3 -->|"s3 notification (status)"| DB
S3 --> CDN
CDN -->|"segments (cache)"| Viewer
Deep dive · resumable chunked upload
The client splits the (up to 256 GB) file into ~10 MB parts uploaded
directly to S3 via presigned multipart URLs — the app
tier never proxies terabytes. On failure it re-PUTs only the missing
parts (tracked via ETags / S3 ListParts).
sequenceDiagram
participant C as Client
participant VS as Video Service
participant DB as Metadata DB
participant S3 as S3
C->>VS: POST /videos (metadata)
VS->>DB: create Video status=pending
VS->>S3: init multipart, get presigned part URLs
VS-->>C: videoId + presigned URLs
loop each 10MB part
C->>S3: PUT part (presigned)
S3-->>C: 200 + ETag
end
C->>VS: POST /videos/:id/complete (ETags)
S3->>DB: s3 notification status=uploaded
Status machine:
pending → uploaded → chunking → chunks_complete. The API
acks immediately; transcoding is async.
Trade-off: direct-to-S3 scales infinitely but loses
inline validation/AV-scanning — mitigate with presigned-URL scoping +
async scanning.
Deep dive · transcoding pipeline (parallel DAG)
One machine transcoding a 256 GB video to 4 renditions takes hours. Instead, split first, then transcode segments in parallel — each 2–10 s clip is GOP/keyframe-aligned and independent, giving near-linear speedup.
flowchart TD
U["Uploaded source in S3"] -->|"s3 notification"| CH["Chunker (probe + GOP-align)"]
CH -->|"2-10s segments"| Q["Job Queue (segment x rendition)"]
Q --> W240["240p worker"]
Q --> W720["720p worker"]
Q --> W1080["1080p worker"]
W240 --> S3O[("S3 rendered segments")]
W720 --> S3O
W1080 --> S3O
S3O --> PKG["Package + concat per rendition"]
PKG --> MAN["Manifest builder (HLS/DASH)"]
MAN -->|"update status"| DB[("Metadata DB")]
Why a queue (SQS/Kafka): back-pressure, retries,
autoscale workers on queue depth, spot/GPU for heavy codecs. Jobs are
idempotent keyed by (videoId, segmentId, rendition).
Segment size 2–10 s is the sweet spot and conveniently equals the ABR
segment size.
Deep dive · adaptive bitrate streaming
Each source becomes multiple renditions, each cut into 2–10 s
segments, described by a manifest (HLS
.m3u8 / DASH .mpd). The player measures
bandwidth + buffer and switches rendition per segment — enabling fast
start and low-bandwidth playback.
sequenceDiagram
participant P as Player
participant CDN as CDN
participant S3 as S3 Origin
P->>CDN: GET master manifest (HLS/DASH)
CDN-->>P: variant playlists (240p..4k)
loop every 2-10s segment
P->>CDN: GET segment (chosen bitrate)
alt cache hit
CDN-->>P: segment
else cache miss
CDN->>S3: fetch segment
S3-->>CDN: segment
CDN-->>P: segment
end
P->>P: measure bandwidth, switch rendition
end
Segment-based delivery enables ABR switching, CDN cacheability of immutable segments, parallel transcoding, and fast start. HLS vs DASH: HLS is ubiquitous/native on Apple; DASH is codec-agnostic but not native on Safari — CMAF/fMP4 packs once and plays everywhere.
Deep dive · view-count aggregation
100M+ views cannot be UPDATE views = views + 1 on one row
(hot-row contention). Player emits view events →
Kafka → stream processor (Flink/Spark) windowed
counts → periodic DB flush (Lambda style: approximate real-time +
exact batch). Use sharded Redis INCR and
HyperLogLog for unique viewers; define a "view" (e.g.
>30 s) to avoid inflation. Counts are eventually consistent —
acceptable for display.
Data model
Video
id, name, description, uploaderId, durationSec, thumbnailUrl
status: pending | uploaded | chunking | chunks_complete
FullS3Url # source object
ChunksByResolution { 240p:[segs], 720p:[segs], 1080p:[segs], 4k:[segs] } # the manifest
visibility: public | unlisted | private
VideoMetadata { videoId, title, tags[], viewCount, likeCount }
User { id, name, email }
PostgreSQL for metadata — ACID, MVCC, JSON/JSONB for
the ChunksByResolution document and status transitions.
Shard by videoId with a time sort key; add read replicas
+ Redis for hot metadata.
Bottleneck summary
Upload → presigned direct-to-S3 multipart. Transcoding → split + parallel segment workers, queue-driven autoscale. Metadata → shard + replicas + cache. Egress (~15–20 Tbps) → CDN offload + immutable segments + ABR. Storage (PB/day) → lifecycle to Glacier + transcode-on-demand for the long tail.