System Design Notes All designs

Fundamentals

Data Structures for System Design

The building blocks that show up again and again in interviews and real distributed systems — what each one is, its Big-O, how it works, when to reach for it, who uses it in production, and the trade-off it makes. At scale every structure trades one resource for another: a little accuracy for huge memory savings, fast writes for slower reads, or ordering for elastic scaling.

Part 1 · Probabilistic / sketch structures

All three share a theme: sub-linear memory, bounded error, single-pass, and mergeable across shards (Bloom = bitwise OR, Count-Min = element-wise add, HyperLogLog = element-wise max). Reach for them when an exact answer is too expensive in memory and an approximate one is good enough.

1 · Bloom filter — "is x in the set?"

Space-efficient probabilistic set membership. A query returns either "might be in the set" (possible false positive) or "definitely not in the set" (never a false negative). A bit array of m bits and k hash functions; insert sets k bits, query checks them — any 0 ⇒ absent.

flowchart LR
    K["key x"] --> H1["hash1"]
    K --> H2["hash2"]
    K --> H3["hash3"]
    H1 --> B["Bit array m bits"]
    H2 --> B
    H3 --> B
    B --> Q{"All k bits set?"}
    Q -- No --> N["Definitely not in set"]
    Q -- Yes --> M["Maybe in set"]
      

2 · Count-Min sketch — "roughly how many times have I seen x?"

Probabilistic frequency table for a stream. incr(x) to add, upper_bound(x) to read an over-estimate (never under-counts). A 2D array of counters (d rows × w cols) with one hash per row; the query takes the min of the d counters to strip collision inflation.

3 · HyperLogLog — "how many distinct items?"

Probabilistic cardinality estimator in tiny fixed memory. Hash each item, use the first p bits to pick one of m = 2^p registers, and track the longest run of leading zeros per register; cardinality comes from the harmonic mean. Two sketches merge by element-wise max.

Structure Answers API Error Merge
Bloom filter Is x a member? add, contains False positives only bitwise OR
Count-Min sketch How many x? incr, upper_bound Over-estimate only element add
HyperLogLog How many distinct? add, card ±small % element max

Part 2 · Storage-engine structures

4 · LSM-tree + SSTable (write-optimized)

Buffer writes in memory, flush them as immutable sorted files (SSTables), and merge in the background — optimized for high write throughput. A write appends to a WAL and a sorted in-memory memtable; when it fills, it flushes to an SSTable; background compaction merges files and drops tombstones. Reads check the memtable then SSTables newest→oldest, using a per-SSTable bloom filter to skip files.

sequenceDiagram
    participant C as Client
    participant W as WAL
    participant M as Memtable
    participant D as Disk SSTables
    C->>W: Append for durability
    C->>M: Insert into sorted memtable
    Note over M: Memtable fills up
    M->>D: Flush as immutable SSTable
    Note over D: Compaction merges SSTables
    C->>M: Read key
    C->>D: Miss checks Bloom then SSTables
    D-->>C: Value or not found
      

Who: RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, Bigtable, DynamoDB storage. Trade-offs: superb writes but read & space/write amplification from compaction.

5 · B-tree / B+tree (read-optimized)

Balanced, high-fanout tree built for disk pages — the default relational index. Shallow (3–4 levels for billions of rows); a B+tree keeps values in linked leaves for fast range scans. Updates happen in place with node split/merge.

flowchart TD
    subgraph BT["B-Tree read optimized"]
        BR["Root"] --> BI1["Internal"]
        BR --> BI2["Internal"]
        BI1 --> BL1["Leaf"]
        BI1 --> BL2["Leaf"]
        BI2 --> BL3["Leaf"]
    end
    subgraph LSM["LSM-Tree write optimized"]
        LM["Memtable"] --> L0["L0 SSTables"]
        L0 --> L1["L1 SSTables"]
        L1 --> L2["L2 SSTables"]
    end
      

Rule of thumb: read-heavy + ranges + transactions → B-tree; write/ingest-heavy → LSM-tree.

Cost: search/insert/delete O(log n); range O(log n + k). Who: PostgreSQL, MySQL/InnoDB, Oracle, SQLite, MongoDB (WiredTiger).

6 · Skip list

Probabilistic ordered structure of linked lists with "express-lane" levels — balanced-tree performance without rotations and easy to make concurrent. Search/insert/delete O(log n) expected. Who: Redis sorted sets (ZSET), RocksDB/LevelDB memtables, Lucene, ConcurrentSkipListMap.

Part 3 · Distribution & integrity

7 · Consistent hashing

Maps keys and nodes onto a ring so adding/removing a node remaps only ~K/N keys instead of almost all of them — essential for elastic sharding. Use many virtual nodes per physical node for even load; a key is owned by the first node clockwise; replicate to the next R nodes.

flowchart TD
    subgraph RING["Hash ring clockwise"]
        NA["Node A vnodes"]
        NB["Node B vnodes"]
        NC["Node C vnodes"]
        NA --> NB --> NC --> NA
    end
    K1["key k1"] --> NA
    K2["key k2"] --> NB
    K3["key k3"] --> NC
      

Who: DynamoDB, Cassandra, Riak, memcached clients (ketama), Akamai, Discord. Trade-offs: hotspots without vnodes; destroys key order (no ranges). Alternatives: rendezvous (HRW), jump, maglev.

8 · Merkle tree (hash tree)

Tree of hashes — leaves hash data blocks, parents hash children, and the single root hash fingerprints the dataset, enabling cheap diffing. Two replicas compare roots and recurse only into differing subtrees. Build O(n), locate a difference O(log n). Who: Dynamo/Cassandra/Riak anti-entropy repair, Git, IPFS, Bitcoin/Ethereum, ZFS.

Part 4 · Spatial indexing

9 · Geohash

Encodes (lat, long) into a 1D base-32 string by interleaving bits — a shared prefix ⇒ spatial proximity, so geo search becomes a B-tree prefix/range query. Longer hash = smaller cell. Handle the boundary problem by also checking the 8 neighbor cells. Who: Redis GEO, Elasticsearch/MongoDB geo; relatives are Google S2, Uber H3.

10 · Quadtree

Recursively subdivides 2D space into four quadrants, splitting a cell only when it holds too many points — so it adapts to density. Search/insert O(log n) when balanced. Great for non-uniform density, viewport and kNN queries. The on-disk DB counterpart is the R-tree (PostGIS, MySQL spatial).

Part 5 · Text & search

11 · Trie (prefix tree)

Tree keyed by string prefixes — every root-to-node path spells a prefix. Insert/search/prefix lookup O(L) (key length), independent of the number of keys. Compress with radix/Patricia tries or an FST. Best for autocomplete, spell-check, IP longest-prefix match; augment nodes with precomputed top-K for ranked suggestions.

12 · Inverted index

The core of search engines: a map from each term → posting list of documents (and positions). Index time tokenizes/normalizes/stems; query time fetches posting lists, intersects/unions them, then ranks with TF-IDF / BM25. Stored in immutable segments that merge LSM-style. Who: Elasticsearch/Lucene, Solr, Postgres GIN, Splunk.

Cheat sheet — choosing a structure

Structure Best for Key op Marquee users
Bloom filter Probably present / definitely absent O(k) Cassandra, RocksDB, HBase
Count-Min sketch Approx frequency / top-K O(d) Streaming analytics, ad counts
HyperLogLog Approx distinct count O(1) add Redis, BigQuery, Presto
LSM-tree + SSTable Write-heavy KV / ingest O(1) write Cassandra, RocksDB, HBase
B-tree / B+tree Reads + range + OLTP O(log n) PostgreSQL, MySQL InnoDB
Skip list In-memory ordered, concurrent O(log n) Redis ZSET, memtables
Consistent hashing Elastic sharding / caching O(log V) Dynamo, Cassandra, ketama
Merkle tree Integrity / replica diff O(log n) Cassandra repair, Git, blockchains
Geohash Proximity via prefix O(precision) Redis GEO, Elasticsearch
Quadtree / R-tree Density-adaptive spatial O(log n) Maps, games, PostGIS
Trie Prefix search / autocomplete O(L) Autocomplete, Lucene FST
Inverted index Full-text search O(1) term Elasticsearch, Lucene

How to pick, in one breath

Membership at scale → bloom filter. "How many / top-K" in a stream → count-min (+ heap). "How many unique" over a firehose → HyperLogLog. Writes dominate → LSM-tree; reads + ranges + transactions → B-tree. Scaling nodes without reshuffling → consistent hashing. Repairing replica divergence → Merkle tree. "Find nearby" → geohash / quadtree. Prefix / autocomplete → trie; full-text → inverted index.