System Design Notes All designs

Storage & Media

YouTube — Video Platform

Upload large videos (resumable, chunked, direct-to-blob) and watch them — transcoded into multiple renditions and delivered via adaptive bitrate streaming over a CDN. Targets: files up to 256 GB, time-to-first-frame < 500 ms even on low bandwidth, 1M uploads/day and 100M views.

Requirements

Functional

Non-functional

Scale & back-of-the-envelope

API design

POST /videos                      # initiate: create record, return videoId + presigned part URLs
PUT  {presignedUrl}               # upload one ~10MB part directly to S3 (resumable)
POST /videos/:id/complete         # finalize multipart upload (ETag list)
GET  /videos/:id                  # metadata + manifest (master playlist) URL
GET  /videos/:id/status           # pending | uploaded | chunking | chunks_complete
GET  {cdn}/videos/:id/master.m3u8 # HLS/DASH manifest from CDN

High-level design

The client requests presigned URLs and uploads ~10 MB chunks directly to S3; an S3 event triggers a Chunker that splits the source into 2–10 s clips, which fan out to per-resolution Transcoders. Rendered segments return to S3, notifications update the metadata DB, and the CDN serves popular segments.

flowchart LR
    Client["Client (10MB chunks)"]
    AG["API Gateway"]
    VS["Video Service"]
    DB[("Metadata DB (Postgres)")]
    S3[("S3 Blob Store")]
    CH["Chunker (2-10s clips)"]
    T240["Transcoder 240p"]
    T720["Transcoder 720p"]
    T1080["Transcoder 1080p"]
    CDN(("CDN edge"))
    Viewer["Viewer / Player"]
    Client -->|"ask presigned URLs"| AG
    AG --> VS
    VS -->|"presigned multipart URLs"| S3
    VS -->|"store metadata"| DB
    Client -->|"multipart upload direct"| S3
    S3 -->|"s3 notification"| CH
    CH --> T240 & T720 & T1080
    T240 --> S3
    T720 --> S3
    T1080 --> S3
    S3 -->|"s3 notification (status)"| DB
    S3 --> CDN
    CDN -->|"segments (cache)"| Viewer
      

Deep dive · resumable chunked upload

The client splits the (up to 256 GB) file into ~10 MB parts uploaded directly to S3 via presigned multipart URLs — the app tier never proxies terabytes. On failure it re-PUTs only the missing parts (tracked via ETags / S3 ListParts).

sequenceDiagram
    participant C as Client
    participant VS as Video Service
    participant DB as Metadata DB
    participant S3 as S3
    C->>VS: POST /videos (metadata)
    VS->>DB: create Video status=pending
    VS->>S3: init multipart, get presigned part URLs
    VS-->>C: videoId + presigned URLs
    loop each 10MB part
        C->>S3: PUT part (presigned)
        S3-->>C: 200 + ETag
    end
    C->>VS: POST /videos/:id/complete (ETags)
    S3->>DB: s3 notification status=uploaded
      

Status machine: pending → uploaded → chunking → chunks_complete. The API acks immediately; transcoding is async. Trade-off: direct-to-S3 scales infinitely but loses inline validation/AV-scanning — mitigate with presigned-URL scoping + async scanning.

Deep dive · transcoding pipeline (parallel DAG)

One machine transcoding a 256 GB video to 4 renditions takes hours. Instead, split first, then transcode segments in parallel — each 2–10 s clip is GOP/keyframe-aligned and independent, giving near-linear speedup.

flowchart TD
    U["Uploaded source in S3"] -->|"s3 notification"| CH["Chunker (probe + GOP-align)"]
    CH -->|"2-10s segments"| Q["Job Queue (segment x rendition)"]
    Q --> W240["240p worker"]
    Q --> W720["720p worker"]
    Q --> W1080["1080p worker"]
    W240 --> S3O[("S3 rendered segments")]
    W720 --> S3O
    W1080 --> S3O
    S3O --> PKG["Package + concat per rendition"]
    PKG --> MAN["Manifest builder (HLS/DASH)"]
    MAN -->|"update status"| DB[("Metadata DB")]
      

Why a queue (SQS/Kafka): back-pressure, retries, autoscale workers on queue depth, spot/GPU for heavy codecs. Jobs are idempotent keyed by (videoId, segmentId, rendition). Segment size 2–10 s is the sweet spot and conveniently equals the ABR segment size.

Deep dive · adaptive bitrate streaming

Each source becomes multiple renditions, each cut into 2–10 s segments, described by a manifest (HLS .m3u8 / DASH .mpd). The player measures bandwidth + buffer and switches rendition per segment — enabling fast start and low-bandwidth playback.

sequenceDiagram
    participant P as Player
    participant CDN as CDN
    participant S3 as S3 Origin
    P->>CDN: GET master manifest (HLS/DASH)
    CDN-->>P: variant playlists (240p..4k)
    loop every 2-10s segment
        P->>CDN: GET segment (chosen bitrate)
        alt cache hit
            CDN-->>P: segment
        else cache miss
            CDN->>S3: fetch segment
            S3-->>CDN: segment
            CDN-->>P: segment
        end
        P->>P: measure bandwidth, switch rendition
    end
      

Segment-based delivery enables ABR switching, CDN cacheability of immutable segments, parallel transcoding, and fast start. HLS vs DASH: HLS is ubiquitous/native on Apple; DASH is codec-agnostic but not native on Safari — CMAF/fMP4 packs once and plays everywhere.

Deep dive · view-count aggregation

100M+ views cannot be UPDATE views = views + 1 on one row (hot-row contention). Player emits view events → Kafka → stream processor (Flink/Spark) windowed counts → periodic DB flush (Lambda style: approximate real-time + exact batch). Use sharded Redis INCR and HyperLogLog for unique viewers; define a "view" (e.g. >30 s) to avoid inflation. Counts are eventually consistent — acceptable for display.

Data model

Video
  id, name, description, uploaderId, durationSec, thumbnailUrl
  status: pending | uploaded | chunking | chunks_complete
  FullS3Url                         # source object
  ChunksByResolution { 240p:[segs], 720p:[segs], 1080p:[segs], 4k:[segs] }  # the manifest
  visibility: public | unlisted | private

VideoMetadata { videoId, title, tags[], viewCount, likeCount }
User          { id, name, email }

PostgreSQL for metadata — ACID, MVCC, JSON/JSONB for the ChunksByResolution document and status transitions. Shard by videoId with a time sort key; add read replicas + Redis for hot metadata.

Bottleneck summary

Upload → presigned direct-to-S3 multipart. Transcoding → split + parallel segment workers, queue-driven autoscale. Metadata → shard + replicas + cache. Egress (~15–20 Tbps) → CDN offload + immutable segments + ABR. Storage (PB/day) → lifecycle to Glacier + transcode-on-demand for the long tail.