System Design Notes All designs

Social & Feeds

Facebook Post Search

Full-text search over posts by keyword, sortable by likes or recency — building the inverted index, ingestion pipeline, ranking, and scatter-gather query layer ourselves (no Elasticsearch/Postgres FTS). The trick: keep two pre-sorted posting lists per keyword so the read path never sorts.

Requirements

Functional

Non-functional

Scale & back-of-the-envelope

Stream Volume Per second
Posts 1B/day ~10k/s
Likes 10B/day ~100k/s (10× posts — the dominant write problem)
Searches 1B/day ~10k/s
Retained posts ~3.6T ~3.6 PB raw → index must be sharded

Like-count churn directly mutates the likes-sorted index → needs an in-memory counter tier; 10k searches/s with 500 ms → hot-query caching + pre-sorted lists are essential.

API design

POST /posts                 { content }                 -> 201 { postId }
PUT  /posts/{id}/likes                                  -> 200 { likeCount }
GET  /search?query={kw}&sort={likes|time}&limit=20&cursor={opaque}
  -> { results:[ {postId, content, likeCount, createdAt} ], nextCursor }

Multiple terms are AND-intersected; sort selects which pre-sorted posting list to read. No OFFSET — a cursor encodes the last item's sort key ((likeCount, postId) or (createdAt, postId)) so each shard resumes with a range seek.

High-level design

A write path (create/like posts) and a read/ingestion path decoupled by CDC + Kafka — which buys the 1-minute eventual-consistency window.

flowchart LR
    Client["Clients"] --> GW["API Gateway"]
    GW --> PS["Post Service"]
    GW --> SS["Search Service + Cache"]
    GW --> LS["Likes Service"]
    PS --> PDB[("Posts DB")]
    LS --> LC{"Like Counts"}
    LS --> LDB[("Likes DB")]
    PDB -- CDC --> K{"Kafka"}
    LC -- CDC --> K
    LC --> S3[("Cold Storage S3")]
    K --> ING["Ingestion Service"]
    ING --> IDX[("Search Index")]
    SS --> IDX
      

The index stores, per keyword, two pre-sorted posting lists:

like:1:{keyword}  -> List<PostId>   # ordered by like count desc
time:2:{keyword}  -> List<PostId>   # ordered by createdAt desc

Deep dive · inverted index & real-time updates

An inverted index maps each keyword → a posting list of post IDs. Two lists per keyword support both sort orders without sorting at query time.

sequenceDiagram
    participant U as User
    participant PS as Post Service
    participant PDB as Posts DB
    participant K as Kafka
    participant IG as Ingestion Service
    participant IDX as Search Index
    U->>PS: POST /posts
    PS->>PDB: Insert post row
    PDB-->>K: CDC change event
    K->>IG: Consume post event
    IG->>IG: Tokenize content into keywords
    IG->>IDX: Append postId to each keyword list
    Note over IDX: lists kept sorted by time and by likes
      

Why CDC + Kafka, not dual-write? Decoupling (a slow index never blocks post creation), replayability (rebuild the whole index by replaying the log / S3), backpressure absorption, and no "wrote-DB-but-crashed-before-index" inconsistency. Ingestion writes immutable segments (LSM-style) that compact — turning random updates into sequential writes; idempotent by postId.

Deep dive · sharding by document vs keyword

flowchart TD
    subgraph ByDoc["Shard by Document"]
        Q1["Query taylor swift"] --> D1["Shard 1 docs"]
        Q1 --> D2["Shard 2 docs"]
        Q1 --> D3["Shard 3 docs"]
        D1 --> M1["Gather and rank"]
        D2 --> M1
        D3 --> M1
    end
    subgraph ByKey["Shard by Keyword"]
        Q2["Query taylor swift"] --> H["Route per keyword"]
        H --> K2["Shard holds taylor"]
        H --> K3["Shard holds swift"]
        K2 --> M2["Intersect and rank"]
        K3 --> M2
    end
      
By document (chosen) By keyword
Writes ✅ Balanced (one shard per post) Skewed by term
Hot term ✅ Spread across shards ❌ Hot shard ("taylor swift")
Reads ❌ Scatter-gather to all N shards ✅ Only as many shards as terms
Multi-term Merge globally Cross-shard intersection

Recommendation: shard by document — even write distribution (critical at 10k posts/s + 100k like-updates/s) with no hot-term shards, accepting scatter-gather reads. Tame fan-out with caching, per-shard early termination on pre-sorted lists, and replicas.

Deep dive · keeping like-counts fresh

The hardest part: 100k likes/s continuously reorder the like:1:{keyword} lists. A like appends to Likes DB and increments an in-memory Like Counts counter (Redis), periodically persisted to S3; count changes are CDC'd into the same Kafka and re-position the post.

flowchart LR
    Like["100k likes per second"] --> LS["Likes Service"]
    LS --> LDB[("Likes DB")]
    LS --> LC{"Like Counts counter"}
    LC -- CDC --> K{"Kafka"}
    LC --> S3[("Cold Storage S3")]
    K --> ING["Ingestion Service"]
    ING --> IDX[("Reorder like list")]
      

Don't index every like — reordering on every increment would melt the index. Batch/coalesce deltas over a window; only re-rank on bucket crossings (log-scaled bands — a post at 1M likes gaining 50 doesn't move, 10→100 does); keep a hot "top-K per keyword" updated aggressively and the long tail lazily. The 1-min freshness budget makes slightly stale ordering legal.

Deep dive · caching hot queries

At 10k searches/s, traffic is heavily skewed toward a few trending terms.

flowchart LR
    Req["Search Request"] --> SS["Search Service"]
    SS --> C{"Cache hit"}
    C -- Yes --> Resp["Return cached page"]
    C -- No --> RT["Scatter to index shards"]
    RT --> SH1["Shard A"]
    RT --> SH2["Shard B"]
    SH1 --> MG["Gather and merge"]
    SH2 --> MG
    MG --> RK["Rank by likes or time"]
    RK --> PG["Apply pagination cursor"]
    PG --> WC["Write to cache"]
    WC --> Resp
      

Cache key = (normalized query, sort, page/cursor); TTL ≤ the 1-min freshness budget so cached results never violate the contract. Results aren't personalized → hot first pages are even CDN-cacheable. Cache the expensive first page; deep pages are rare.

Data model

Posts DB      id PK, content, authorId, createdDate
Likes DB      (postId, likerId) PK, createdAt        # composite PK => idempotent likes
Like Counts   postId PK, likeCount                   # in-memory + S3 backup, CDC'd
Search Index  like:1:{keyword} -> List<PostId>        # ordered by likes desc
              time:2:{keyword} -> List<PostId>        # ordered by time desc
              (immutable compacted segments, early-termination reads)

Consistency posture

Search is eventually consistent (~1 min); posts and like-orderings converge through CDC → Kafka → ingestion. This trade — availability + low read latency over strict consistency — is what makes async indexing, count batching, and short-TTL caching all valid.