System Design Notes All designs

Social & Feeds

Facebook Live Comments

Any viewer posts a comment and every other viewer of that live video sees it in ~200 ms; newcomers scroll back through earlier comments. The defining challenge is fan-out — one comment on a hot stream pushed to up to a million connections, across millions of simultaneous videos.

Requirements

Functional

Non-functional

Scale & back-of-the-envelope

A hot video can have ~1M concurrent connections and ~100 comments/sec. Each comment is delivered ~1M times → ~100M sends/sec at ~48 B each ≈ ~4.8 GB/s egress for one hot video.

1M viewers × 100 comments/s × 48 bytes ≈ 4.8 GB/s outbound for ONE video

The bottleneck is outbound fan-out + connection count, not DB throughput — hence a dedicated fleet of ~1000 connection servers and a pub/sub layer, with a push-only broadcast path (never read the DB to broadcast).

API design

POST /comments/:liveVideoId          { content }   -> 200    # ack writer fast
GET  /comments/:liveVideoId?cursor&direction={before|after}&pageSize=50
GET  /comments/:liveVideoId          Accept: text/event-stream   # SSE live stream

Realtime delivery is SSE (server→viewer); posting is a normal POST. SSE has native auto-reconnect via Last-Event-ID to resume from the last delivered comment.

High-level design

A stateless Comment Service (writes + history) is separated from a horizontally-scaled fleet of stateful Realtime Comment Service servers that own the SSE connections, glued by a pub/sub dispatcher.

flowchart LR
    C["Client browser"]
    GW["API Gateway / LB"]
    CS["Comment Service"]
    DB[("Comments DB: Cassandra")]
    PS["Pub/Sub Dispatcher"]
    RT["Realtime Comment Service x1000"]
    ZK["ZooKeeper / etcd"]
    C -->|"POST /comments/:videoId"| GW
    GW -->|"write path"| CS
    CS -->|"persist"| DB
    CS -->|"publish comment"| PS
    PS -->|"topic = hash(videoId) % N"| RT
    RT -->|"SSE push"| GW
    GW -->|"SSE events"| C
    RT -.->|"heartbeat / conn counts"| ZK
      

Each realtime server keeps an in-memory { videoId: [connections] } map; when pub/sub delivers a comment it pushes to exactly those local sockets — no DB read on the broadcast path.

Deep dive · SSE vs WebSocket vs long-poll

Delivery is one-directional, so SSE wins — it rides on plain HTTP, so existing gateways and proxies handle it without an upgrade handshake.

Aspect Long-poll WebSocket SSE (chosen)
Direction Pseudo-push Full duplex Server → client
Protocol HTTP Custom upgrade HTTP (streamed)
Infra friendliness High Low (upgrade everywhere) High
Reconnect Manual App-level Native + Last-Event-ID
Fit Wasteful Overkill Best fit

Deep dive · the pub/sub dispatcher

A comment for video1 has viewers spread across many of the ~1000 servers. The dispatcher must deliver only to servers holding connections for that video — not broadcast to all 1000.

sequenceDiagram
    participant V as Viewer A
    participant CS as Comment Service
    participant DB as Cassandra
    participant PS as Pub/Sub
    participant RT as Realtime Svc
    participant V2 as Viewers B..N
    V->>CS: POST comment for videoId
    CS->>DB: persist comment
    CS-->>V: 200 OK
    CS->>PS: publish to topic hash(videoId) % N
    PS->>RT: deliver new comment
    RT->>RT: lookup conns for videoId
    RT-->>V2: SSE event new comment
      

Routing by topic hashing (hash(videoId) % N): a server subscribes only to topics for the videos it currently holds, so inbound traffic is proportional to its own connections. Pub/sub beats an explicit ZooKeeper registry for routing (looser coupling, broker absorbs churn) — ZK still tracks health + connection counts. At-least-once delivery means clients dedupe by commentId; per-video ordering holds by keeping a video on one topic.

Deep dive · scaling the connection servers

Each server holds a bounded number of SSE sockets (memory, file descriptors, ephemeral ports), so ~1M connections on a hot video spread across many of the ~1000 servers.

sequenceDiagram
    participant V as Viewer
    participant GW as API Gateway
    participant ZK as ZooKeeper
    participant RT as Realtime Svc
    V->>GW: Open SSE for videoId
    GW->>ZK: pick healthy server for videoId
    ZK-->>GW: server1 (least loaded)
    GW->>RT: route connection
    RT->>RT: add conn to map videoId -> [conn]
    RT->>ZK: heartbeat conn count + status
    RT-->>V: SSE stream open
      

On server loss, clients auto-reconnect (SSE native) with jittered backoff to avoid a reconnect storm; a server unsubscribes from a video's topic when its last connection for it drops. Stateful sockets make deploys disruptive → graceful drain + staggered rollouts.

Data model

Cassandra — write-optimized, partition-friendly for the videoId access pattern, aligned with availability > consistency.

CREATE TABLE comments_by_video (
  video_id   text,        -- partition key (shard / fan-out unit)
  created_at timeuuid,    -- clustering key, DESC for recency + back-paging
  comment_id uuid,        -- client dedupe key (at-least-once delivery)
  author_id  uuid,
  content    text,
  PRIMARY KEY ((video_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

Bottleneck summary

Hot-video egress (~4.8 GB/s) → push-only broadcast, spread connections, regional edge fan-out. Hot Cassandra partition → time-bucket videoId + hour + cache recent window. Dispatch amplification → hash(videoId) % N topics. Reconnect storms → native SSE + jittered backoff + graceful drain.