System Design Notes All designs

Messaging

WhatsApp — Chat & Messaging

Global real-time 1:1 and group chat with media and offline delivery for billions of users. The core challenge: hold hundreds of millions of persistent connections and route each message to the right recipient's socket (or queue) in < 500 ms, guaranteeing delivery while not storing messages longer than necessary.

Requirements

Functional

Non-functional

Scale & back-of-the-envelope

API design

REST for the control plane + media; WebSocket for the data-plane hot path.

# Control plane (REST)
POST /chats | POST /chats/{id}/participants | GET /chats/{id}/messages
POST /media/presign  -> presigned S3 PUT + blobRef   # media goes direct to S3

# Data plane (WebSocket)
C->S: send_message {clientMsgId, chatId, contents}   # idempotent
C->S: ack_delivered {messageId} | ack_read {chatId, upToMessageId} | pull_backlog {sinceSeq}
S->C: message {...} | ack {SENT} | receipt {DELIVERED|READ} | presence {...}

High-level design

A Chat Registry (over ZooKeeper) tells a client which Chat Server to connect to; servers route to each other via Redis pub/sub; messages persist to DynamoDB; an Inbox queues messages for offline devices; media flows through S3; a Cleanup worker enforces "not stored unnecessarily."

flowchart LR
    C1["Client A (phone)"]
    C2["Client B (laptop)"]
    LB["Load Balancer"]
    REG["Chat Registry"]
    ZK["Zookeeper"]
    WS1["Chat Server 1 (WebSocket)"]
    WS2["Chat Server 2 (WebSocket)"]
    PS["Redis Pub/Sub"]
    DB[("DynamoDB: chats, messages, inbox")]
    BLOB[("Blob Store S3")]
    CLEAN["Cleanup Worker"]
    PUSH["Push APNs / FCM"]
    C1 -->|"1 which server?"| REG
    REG --> ZK
    C1 -->|"2 WSS connect"| LB
    LB --> WS1
    C2 --> LB --> WS2
    WS1 --> PS --> WS2
    WS1 --> DB
    WS2 --> DB
    C1 -->|"media upload"| BLOB
    BLOB --> C2
    WS1 --> PUSH
    CLEAN --> DB
    ZK --> WS1
      

ZooKeeper stores ephemeral clientId → chatServer assignments — when a server dies its entries vanish, so routing self-heals; Redis pub/sub is the cross-server hop.

Deep dive · delivery to an online user (receipts)

A message is persisted before it's acked to the sender (durability first), then routed to the recipient's server and pushed down their socket; receipts flow back to drive the tick states SENT (✓) → DELIVERED (✓✓) → READ (blue).

sequenceDiagram
    participant A as Client A
    participant SA as Chat Server A
    participant DB as DynamoDB
    participant SB as Chat Server B
    participant B as Client B
    A->>SA: send_message(clientMsgId, chatId, text)
    SA->>DB: persist message + inbox row (B)
    SA-->>A: ack SENT (1 tick)
    SA->>SB: route via Redis Pub/Sub
    SB->>B: deliver message over WSS
    B-->>SB: ack_delivered
    SB->>DB: mark delivered, clear inbox row
    SB->>SA: DELIVERED
    SA-->>A: DELIVERED (2 ticks)
    B->>SB: ack_read
    SA-->>A: READ (blue ticks)
      

Persist-then-ack costs one hot-path DB write but is the only way to guarantee delivery across crashes. Redis pub/sub is at-most-once on the wire — safe because the persisted Inbox is the source of truth and is re-drained on reconnect.

Deep dive · delivery to an offline user

With no live socket, the message stays in the recipient's Inbox (QUEUED); a push notification wakes the app; on reconnect the client drains the inbox in order, acks each, and the server deletes delivered rows.

sequenceDiagram
    participant A as Client A
    participant SA as Chat Server A
    participant DB as Inbox / DynamoDB
    participant PUSH as APNs / FCM
    participant B as Client B (offline)
    A->>SA: send_message(chatId, text)
    SA->>DB: write to B inbox (QUEUED)
    SA->>PUSH: trigger push notification
    PUSH->>B: notification (wake app)
    B->>SA: WSS connect + pull_backlog(sinceSeq)
    SA->>DB: read QUEUED for B (ordered)
    SA->>B: deliver backlog
    B-->>SA: ack_delivered (each)
    SA->>DB: delete delivered rows
      

DynamoDB TTL is a safety-net GC for delivered rows whose ack-driven delete was missed.

Deep dive · group fan-out & ordering

On send, look up ChatParticipant, write one inbox row per recipient-client, and route to those online. A group message is deleted only once every participant has acked.

flowchart TD
    A["Sender Client"] --> WS["Chat Server"]
    WS --> DB[("Lookup ChatParticipant")]
    DB --> FAN{"Fan-out per participant device"}
    FAN --> I1["Inbox: member 1"]
    FAN --> I2["Inbox: member 2"]
    FAN --> I3["Inbox: member 3"]
    I1 --> O1["Online: push via WSS"]
    I2 --> O2["Offline: keep QUEUED"]
    I3 --> O3["Online: push via WSS"]
    O1 --> ACK["Collect ACKs"]
    O3 --> ACK
    ACK --> DEL["Delete when ALL members ack"]
      

Fan-out on write gives fast reads + simple offline delivery but is expensive for large groups (cap ~1,024 members). Ordering: assign a time-sortable ID (Snowflake/KSUID) at the first server; clients re-sequence by ID and dedupe by clientMsgId. A strict total order would need a per-chat sequencer bottleneck — per-chat ordering localizes the guarantee.

Deep dive · multi-device & E2E encryption

Data model

Chat            id PK; name, isGroup, metadata
ChatParticipant chatId PK, participantId SK
Client          userId PK, clientId SK; pushToken, lastSeenSeq   # multi-device
Messages        id PK; chatId GSI, contents|blobRef, creatorId, timestamp
Inbox           recipientClientId PK, messageId SK; status QUEUED|DELIVERED  # mailbox
Availability    userId PK; status, connectedServer                # presence

Design through-line

The Inbox/mailbox is the source of truth for delivery, the WebSocket is just a fast path, and deletion-after-ack simultaneously delivers guaranteed delivery, low latency, and "messages not stored unnecessarily." DynamoDB (partitioned by chatId / recipientClientId) + S3 for media + ZooKeeper for routing + Redis pub/sub for the hop.