Fundamentals

Data Structures for System Design

The building blocks that show up again and again in interviews and real distributed systems — what each one is, its Big-O, how it works, when to reach for it, who uses it in production, and the trade-off it makes. At scale every structure trades one resource for another: a little accuracy for huge memory savings, fast writes for slower reads, or ordering for elastic scaling.

Part 1 · Probabilistic / sketch structures

All three share a theme: sub-linear memory, bounded error, single-pass, and mergeable across shards (Bloom = bitwise OR, Count-Min = element-wise add, HyperLogLog = element-wise max). Reach for them when an exact answer is too expensive in memory and an approximate one is good enough.

1 · Bloom filter — "is x in the set?"

Space-efficient probabilistic set membership. A query returns either "might be in the set" (possible false positive) or "definitely not in the set" (never a false negative). A bit array of m bits and k hash functions; insert sets k bits, query checks them — any 0 ⇒ absent.

flowchart LR
    K["key x"] --> H1["hash1"]
    K --> H2["hash2"]
    K --> H3["hash3"]
    H1 --> B["Bit array m bits"]
    H2 --> B
    H3 --> B
    B --> Q{"All k bits set?"}
    Q -- No --> N["Definitely not in set"]
    Q -- Yes --> M["Maybe in set"]

Cost: insert/query O(k); ~9.6 bits/item for 1% FPR — tiny vs storing keys.
Use when: membership test + space-constrained + can tolerate false positives.
Who: Cassandra, HBase, RocksDB/LevelDB/Bigtable (per-SSTable filters), RedisBloom, crawler URL dedup.
Trade-offs: can't delete (use Counting Bloom) or resize (Scalable Bloom); can't enumerate. Cuckoo filters add deletes.

2 · Count-Min sketch — "roughly how many times have I seen x?"

Probabilistic frequency table for a stream. incr(x) to add, upper_bound(x) to read an over-estimate (never under-counts). A 2D array of counters (d rows × w cols) with one hash per row; the query takes the min of the d counters to strip collision inflation.

Cost: update/query O(d); error ≤ ε·N with prob ≥ 1−δ.
Use when: count queries on known items, space-constrained, tolerate an upper bound.
Who: heavy-hitters / top-K (sketch + min-heap), trending topics, ad-click counts, per-flow network counters.
Trade-offs: over-estimates only; can't list items (point queries) → top-K needs an auxiliary heap.

3 · HyperLogLog — "how many distinct items?"

Probabilistic cardinality estimator in tiny fixed memory. Hash each item, use the first p bits to pick one of m = 2^p registers, and track the longest run of leading zeros per register; cardinality comes from the harmonic mean. Two sketches merge by element-wise max.

Cost: add O(1); error ≈ 1.04/√m. Redis uses 16384 registers → ~0.81% error in ~12 KB.
Use when: count uniques over a firehose, limited memory, tolerate small error; mergeable across shards/windows.
Who: Redis PFADD/PFCOUNT, BigQuery APPROX_COUNT_DISTINCT, Presto, Druid, Elasticsearch.
Trade-offs: answers only cardinality; can't list members; overkill for tiny sets.

Structure	Answers	API	Error	Merge
Bloom filter	Is x a member?	add, contains	False positives only	bitwise OR
Count-Min sketch	How many x?	incr, upper_bound	Over-estimate only	element add
HyperLogLog	How many distinct?	add, card	±small %	element max

Part 2 · Storage-engine structures

4 · LSM-tree + SSTable (write-optimized)

Buffer writes in memory, flush them as immutable sorted files (SSTables), and merge in the background — optimized for high write throughput. A write appends to a WAL and a sorted in-memory memtable; when it fills, it flushes to an SSTable; background compaction merges files and drops tombstones. Reads check the memtable then SSTables newest→oldest, using a per-SSTable bloom filter to skip files.

sequenceDiagram
    participant C as Client
    participant W as WAL
    participant M as Memtable
    participant D as Disk SSTables
    C->>W: Append for durability
    C->>M: Insert into sorted memtable
    Note over M: Memtable fills up
    M->>D: Flush as immutable SSTable
    Note over D: Compaction merges SSTables
    C->>M: Read key
    C->>D: Miss checks Bloom then SSTables
    D-->>C: Value or not found

Who: RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, Bigtable, DynamoDB storage. Trade-offs: superb writes but read & space/write amplification from compaction.

5 · B-tree / B+tree (read-optimized)

Balanced, high-fanout tree built for disk pages — the default relational index. Shallow (3–4 levels for billions of rows); a B+tree keeps values in linked leaves for fast range scans. Updates happen in place with node split/merge.

flowchart TD
    subgraph BT["B-Tree read optimized"]
        BR["Root"] --> BI1["Internal"]
        BR --> BI2["Internal"]
        BI1 --> BL1["Leaf"]
        BI1 --> BL2["Leaf"]
        BI2 --> BL3["Leaf"]
    end
    subgraph LSM["LSM-Tree write optimized"]
        LM["Memtable"] --> L0["L0 SSTables"]
        L0 --> L1["L1 SSTables"]
        L1 --> L2["L2 SSTables"]
    end

Rule of thumb: read-heavy + ranges + transactions → B-tree; write/ingest-heavy → LSM-tree.

Cost: search/insert/delete O(log n); range O(log n + k). Who: PostgreSQL, MySQL/InnoDB, Oracle, SQLite, MongoDB (WiredTiger).

6 · Skip list

Probabilistic ordered structure of linked lists with "express-lane" levels — balanced-tree performance without rotations and easy to make concurrent. Search/insert/delete O(log n) expected. Who: Redis sorted sets (ZSET), RocksDB/LevelDB memtables, Lucene, ConcurrentSkipListMap.

Part 3 · Distribution & integrity

7 · Consistent hashing

Maps keys and nodes onto a ring so adding/removing a node remaps only ~K/N keys instead of almost all of them — essential for elastic sharding. Use many virtual nodes per physical node for even load; a key is owned by the first node clockwise; replicate to the next R nodes.

flowchart TD
    subgraph RING["Hash ring clockwise"]
        NA["Node A vnodes"]
        NB["Node B vnodes"]
        NC["Node C vnodes"]
        NA --> NB --> NC --> NA
    end
    K1["key k1"] --> NA
    K2["key k2"] --> NB
    K3["key k3"] --> NC

Who: DynamoDB, Cassandra, Riak, memcached clients (ketama), Akamai, Discord. Trade-offs: hotspots without vnodes; destroys key order (no ranges). Alternatives: rendezvous (HRW), jump, maglev.

8 · Merkle tree (hash tree)

Tree of hashes — leaves hash data blocks, parents hash children, and the single root hash fingerprints the dataset, enabling cheap diffing. Two replicas compare roots and recurse only into differing subtrees. Build O(n), locate a difference O(log n). Who: Dynamo/Cassandra/Riak anti-entropy repair, Git, IPFS, Bitcoin/Ethereum, ZFS.

Part 4 · Spatial indexing

9 · Geohash

Encodes (lat, long) into a 1D base-32 string by interleaving bits — a shared prefix ⇒ spatial proximity, so geo search becomes a B-tree prefix/range query. Longer hash = smaller cell. Handle the boundary problem by also checking the 8 neighbor cells. Who: Redis GEO, Elasticsearch/MongoDB geo; relatives are Google S2, Uber H3.

10 · Quadtree

Recursively subdivides 2D space into four quadrants, splitting a cell only when it holds too many points — so it adapts to density. Search/insert O(log n) when balanced. Great for non-uniform density, viewport and kNN queries. The on-disk DB counterpart is the R-tree (PostGIS, MySQL spatial).

Part 5 · Text & search

11 · Trie (prefix tree)

Tree keyed by string prefixes — every root-to-node path spells a prefix. Insert/search/prefix lookup O(L) (key length), independent of the number of keys. Compress with radix/Patricia tries or an FST. Best for autocomplete, spell-check, IP longest-prefix match; augment nodes with precomputed top-K for ranked suggestions.

12 · Inverted index

The core of search engines: a map from each term → posting list of documents (and positions). Index time tokenizes/normalizes/stems; query time fetches posting lists, intersects/unions them, then ranks with TF-IDF / BM25. Stored in immutable segments that merge LSM-style. Who: Elasticsearch/Lucene, Solr, Postgres GIN, Splunk.

Cheat sheet — choosing a structure

Structure	Best for	Key op	Marquee users
Bloom filter	Probably present / definitely absent	`O(k)`	Cassandra, RocksDB, HBase
Count-Min sketch	Approx frequency / top-K	`O(d)`	Streaming analytics, ad counts
HyperLogLog	Approx distinct count	`O(1)` add	Redis, BigQuery, Presto
LSM-tree + SSTable	Write-heavy KV / ingest	`O(1)` write	Cassandra, RocksDB, HBase
B-tree / B+tree	Reads + range + OLTP	`O(log n)`	PostgreSQL, MySQL InnoDB
Skip list	In-memory ordered, concurrent	`O(log n)`	Redis ZSET, memtables
Consistent hashing	Elastic sharding / caching	`O(log V)`	Dynamo, Cassandra, ketama
Merkle tree	Integrity / replica diff	`O(log n)`	Cassandra repair, Git, blockchains
Geohash	Proximity via prefix	`O(precision)`	Redis GEO, Elasticsearch
Quadtree / R-tree	Density-adaptive spatial	`O(log n)`	Maps, games, PostGIS
Trie	Prefix search / autocomplete	`O(L)`	Autocomplete, Lucene FST
Inverted index	Full-text search	`O(1)` term	Elasticsearch, Lucene

How to pick, in one breath

Membership at scale → bloom filter. "How many / top-K" in a stream → count-min (+ heap). "How many unique" over a firehose → HyperLogLog. Writes dominate → LSM-tree; reads + ranges + transactions → B-tree. Scaling nodes without reshuffling → consistent hashing. Repairing replica divergence → Merkle tree. "Find nearby" → geohash / quadtree. Prefix / autocomplete → trie; full-text → inverted index.