Social & Feeds
Facebook Live Comments
Any viewer posts a comment and every other viewer of that live video sees it in ~200 ms; newcomers scroll back through earlier comments. The defining challenge is fan-out — one comment on a hot stream pushed to up to a million connections, across millions of simultaneous videos.
Requirements
Functional
- Post comments; see all comments in near real time; see comments posted before you joined (scroll-back).
Non-functional
- Scale to millions of videos, thousands of comments/sec; availability > consistency for posting.
- ~200 ms broadcast latency.
Scale & back-of-the-envelope
A hot video can have ~1M concurrent connections and ~100 comments/sec. Each comment is delivered ~1M times → ~100M sends/sec at ~48 B each ≈ ~4.8 GB/s egress for one hot video.
1M viewers × 100 comments/s × 48 bytes ≈ 4.8 GB/s outbound for ONE video
The bottleneck is outbound fan-out + connection count, not DB throughput — hence a dedicated fleet of ~1000 connection servers and a pub/sub layer, with a push-only broadcast path (never read the DB to broadcast).
API design
POST /comments/:liveVideoId { content } -> 200 # ack writer fast
GET /comments/:liveVideoId?cursor&direction={before|after}&pageSize=50
GET /comments/:liveVideoId Accept: text/event-stream # SSE live stream
Realtime delivery is SSE (server→viewer); posting is
a normal POST. SSE has native auto-reconnect via
Last-Event-ID to resume from the last delivered comment.
High-level design
A stateless Comment Service (writes + history) is separated from a horizontally-scaled fleet of stateful Realtime Comment Service servers that own the SSE connections, glued by a pub/sub dispatcher.
flowchart LR
C["Client browser"]
GW["API Gateway / LB"]
CS["Comment Service"]
DB[("Comments DB: Cassandra")]
PS["Pub/Sub Dispatcher"]
RT["Realtime Comment Service x1000"]
ZK["ZooKeeper / etcd"]
C -->|"POST /comments/:videoId"| GW
GW -->|"write path"| CS
CS -->|"persist"| DB
CS -->|"publish comment"| PS
PS -->|"topic = hash(videoId) % N"| RT
RT -->|"SSE push"| GW
GW -->|"SSE events"| C
RT -.->|"heartbeat / conn counts"| ZK
Each realtime server keeps an in-memory
{ videoId: [connections] } map; when pub/sub delivers a
comment it pushes to exactly those local sockets — no DB read on the
broadcast path.
Deep dive · SSE vs WebSocket vs long-poll
Delivery is one-directional, so SSE wins — it rides on plain HTTP, so existing gateways and proxies handle it without an upgrade handshake.
| Aspect | Long-poll | WebSocket | SSE (chosen) |
|---|---|---|---|
| Direction | Pseudo-push | Full duplex | Server → client |
| Protocol | HTTP | Custom upgrade | HTTP (streamed) |
| Infra friendliness | High | Low (upgrade everywhere) | High |
| Reconnect | Manual | App-level | Native + Last-Event-ID |
| Fit | Wasteful | Overkill | Best fit |
Deep dive · the pub/sub dispatcher
A comment for video1 has viewers spread across many of
the ~1000 servers. The dispatcher must deliver
only to servers holding connections for that video —
not broadcast to all 1000.
sequenceDiagram
participant V as Viewer A
participant CS as Comment Service
participant DB as Cassandra
participant PS as Pub/Sub
participant RT as Realtime Svc
participant V2 as Viewers B..N
V->>CS: POST comment for videoId
CS->>DB: persist comment
CS-->>V: 200 OK
CS->>PS: publish to topic hash(videoId) % N
PS->>RT: deliver new comment
RT->>RT: lookup conns for videoId
RT-->>V2: SSE event new comment
Routing by topic hashing (hash(videoId) % N): a server subscribes only to topics for the videos it currently
holds, so inbound traffic is proportional to its own connections.
Pub/sub beats an explicit ZooKeeper registry for routing (looser
coupling, broker absorbs churn) — ZK still tracks
health + connection counts. At-least-once delivery
means clients dedupe by commentId; per-video ordering
holds by keeping a video on one topic.
Deep dive · scaling the connection servers
Each server holds a bounded number of SSE sockets (memory, file descriptors, ephemeral ports), so ~1M connections on a hot video spread across many of the ~1000 servers.
sequenceDiagram
participant V as Viewer
participant GW as API Gateway
participant ZK as ZooKeeper
participant RT as Realtime Svc
V->>GW: Open SSE for videoId
GW->>ZK: pick healthy server for videoId
ZK-->>GW: server1 (least loaded)
GW->>RT: route connection
RT->>RT: add conn to map videoId -> [conn]
RT->>ZK: heartbeat conn count + status
RT-->>V: SSE stream open
On server loss, clients auto-reconnect (SSE native) with jittered backoff to avoid a reconnect storm; a server unsubscribes from a video's topic when its last connection for it drops. Stateful sockets make deploys disruptive → graceful drain + staggered rollouts.
Data model
Cassandra — write-optimized, partition-friendly for
the videoId access pattern, aligned with availability
> consistency.
CREATE TABLE comments_by_video (
video_id text, -- partition key (shard / fan-out unit)
created_at timeuuid, -- clustering key, DESC for recency + back-paging
comment_id uuid, -- client dedupe key (at-least-once delivery)
author_id uuid,
content text,
PRIMARY KEY ((video_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);
Bottleneck summary
Hot-video egress (~4.8 GB/s) → push-only broadcast, spread
connections, regional edge fan-out. Hot Cassandra partition →
time-bucket videoId + hour + cache recent window.
Dispatch amplification → hash(videoId) % N topics.
Reconnect storms → native SSE + jittered backoff + graceful drain.