Fundamentals
Data Structures for System Design
The building blocks that show up again and again in interviews and real distributed systems — what each one is, its Big-O, how it works, when to reach for it, who uses it in production, and the trade-off it makes. At scale every structure trades one resource for another: a little accuracy for huge memory savings, fast writes for slower reads, or ordering for elastic scaling.
Part 1 · Probabilistic / sketch structures
All three share a theme: sub-linear memory, bounded error, single-pass, and mergeable across shards (Bloom = bitwise OR, Count-Min = element-wise add, HyperLogLog = element-wise max). Reach for them when an exact answer is too expensive in memory and an approximate one is good enough.
1 · Bloom filter — "is x in the set?"
Space-efficient probabilistic set membership. A query
returns either "might be in the set" (possible false
positive) or "definitely not in the set" (never a false
negative). A bit array of m bits and k hash
functions; insert sets k bits, query checks them — any 0
⇒ absent.
flowchart LR
K["key x"] --> H1["hash1"]
K --> H2["hash2"]
K --> H3["hash3"]
H1 --> B["Bit array m bits"]
H2 --> B
H3 --> B
B --> Q{"All k bits set?"}
Q -- No --> N["Definitely not in set"]
Q -- Yes --> M["Maybe in set"]
-
Cost: insert/query
O(k); ~9.6 bits/item for 1% FPR — tiny vs storing keys. - Use when: membership test + space-constrained + can tolerate false positives.
- Who: Cassandra, HBase, RocksDB/LevelDB/Bigtable (per-SSTable filters), RedisBloom, crawler URL dedup.
- Trade-offs: can't delete (use Counting Bloom) or resize (Scalable Bloom); can't enumerate. Cuckoo filters add deletes.
2 · Count-Min sketch — "roughly how many times have I seen x?"
Probabilistic frequency table for a stream.
incr(x) to add, upper_bound(x) to read an
over-estimate (never under-counts). A 2D array of
counters (d rows × w cols) with one hash per
row; the query takes the min of the
d counters to strip collision inflation.
-
Cost: update/query
O(d); error ≤ε·Nwith prob ≥1−δ. - Use when: count queries on known items, space-constrained, tolerate an upper bound.
- Who: heavy-hitters / top-K (sketch + min-heap), trending topics, ad-click counts, per-flow network counters.
- Trade-offs: over-estimates only; can't list items (point queries) → top-K needs an auxiliary heap.
3 · HyperLogLog — "how many distinct items?"
Probabilistic cardinality estimator in tiny fixed
memory. Hash each item, use the first p bits to pick one
of m = 2^p registers, and track the longest run of
leading zeros per register; cardinality comes from the harmonic mean.
Two sketches merge by element-wise max.
-
Cost: add
O(1); error ≈1.04/√m. Redis uses 16384 registers → ~0.81% error in ~12 KB. - Use when: count uniques over a firehose, limited memory, tolerate small error; mergeable across shards/windows.
-
Who: Redis
PFADD/PFCOUNT, BigQueryAPPROX_COUNT_DISTINCT, Presto, Druid, Elasticsearch. - Trade-offs: answers only cardinality; can't list members; overkill for tiny sets.
| Structure | Answers | API | Error | Merge |
|---|---|---|---|---|
| Bloom filter | Is x a member? | add, contains | False positives only | bitwise OR |
| Count-Min sketch | How many x? | incr, upper_bound | Over-estimate only | element add |
| HyperLogLog | How many distinct? | add, card | ±small % | element max |
Part 2 · Storage-engine structures
4 · LSM-tree + SSTable (write-optimized)
Buffer writes in memory, flush them as immutable sorted files (SSTables), and merge in the background — optimized for high write throughput. A write appends to a WAL and a sorted in-memory memtable; when it fills, it flushes to an SSTable; background compaction merges files and drops tombstones. Reads check the memtable then SSTables newest→oldest, using a per-SSTable bloom filter to skip files.
sequenceDiagram
participant C as Client
participant W as WAL
participant M as Memtable
participant D as Disk SSTables
C->>W: Append for durability
C->>M: Insert into sorted memtable
Note over M: Memtable fills up
M->>D: Flush as immutable SSTable
Note over D: Compaction merges SSTables
C->>M: Read key
C->>D: Miss checks Bloom then SSTables
D-->>C: Value or not found
Who: RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, Bigtable, DynamoDB storage. Trade-offs: superb writes but read & space/write amplification from compaction.
5 · B-tree / B+tree (read-optimized)
Balanced, high-fanout tree built for disk pages — the default relational index. Shallow (3–4 levels for billions of rows); a B+tree keeps values in linked leaves for fast range scans. Updates happen in place with node split/merge.
flowchart TD
subgraph BT["B-Tree read optimized"]
BR["Root"] --> BI1["Internal"]
BR --> BI2["Internal"]
BI1 --> BL1["Leaf"]
BI1 --> BL2["Leaf"]
BI2 --> BL3["Leaf"]
end
subgraph LSM["LSM-Tree write optimized"]
LM["Memtable"] --> L0["L0 SSTables"]
L0 --> L1["L1 SSTables"]
L1 --> L2["L2 SSTables"]
end
Rule of thumb: read-heavy + ranges + transactions → B-tree; write/ingest-heavy → LSM-tree.
Cost: search/insert/delete O(log n);
range O(log n + k). Who: PostgreSQL,
MySQL/InnoDB, Oracle, SQLite, MongoDB (WiredTiger).
6 · Skip list
Probabilistic ordered structure of linked lists with "express-lane"
levels — balanced-tree performance
without rotations and easy to make concurrent.
Search/insert/delete O(log n) expected.
Who: Redis sorted sets (ZSET), RocksDB/LevelDB
memtables, Lucene, ConcurrentSkipListMap.
Part 3 · Distribution & integrity
7 · Consistent hashing
Maps keys and nodes onto a ring so adding/removing a node remaps only
~K/N keys instead of almost all of them — essential for
elastic sharding. Use many virtual nodes per physical
node for even load; a key is owned by the first node clockwise;
replicate to the next R nodes.
flowchart TD
subgraph RING["Hash ring clockwise"]
NA["Node A vnodes"]
NB["Node B vnodes"]
NC["Node C vnodes"]
NA --> NB --> NC --> NA
end
K1["key k1"] --> NA
K2["key k2"] --> NB
K3["key k3"] --> NC
Who: DynamoDB, Cassandra, Riak, memcached clients (ketama), Akamai, Discord. Trade-offs: hotspots without vnodes; destroys key order (no ranges). Alternatives: rendezvous (HRW), jump, maglev.
8 · Merkle tree (hash tree)
Tree of hashes — leaves hash data blocks, parents hash children, and
the single root hash fingerprints the dataset,
enabling cheap diffing. Two replicas compare roots
and recurse only into differing subtrees. Build O(n),
locate a difference O(log n).
Who: Dynamo/Cassandra/Riak anti-entropy repair, Git,
IPFS, Bitcoin/Ethereum, ZFS.
Part 4 · Spatial indexing
9 · Geohash
Encodes (lat, long) into a 1D
base-32 string by interleaving bits — a
shared prefix ⇒ spatial proximity, so geo search
becomes a B-tree prefix/range query. Longer hash = smaller cell.
Handle the boundary problem by also checking the 8
neighbor cells. Who: Redis GEO, Elasticsearch/MongoDB
geo; relatives are Google S2, Uber H3.
10 · Quadtree
Recursively subdivides 2D space into four quadrants, splitting a cell
only when it holds too many points — so it
adapts to density. Search/insert
O(log n) when balanced. Great for non-uniform density,
viewport and kNN queries. The on-disk DB counterpart is the
R-tree (PostGIS, MySQL spatial).
Part 5 · Text & search
11 · Trie (prefix tree)
Tree keyed by string prefixes — every root-to-node path spells a
prefix. Insert/search/prefix lookup O(L) (key length),
independent of the number of keys. Compress with radix/Patricia tries
or an FST. Best for autocomplete,
spell-check, IP longest-prefix match; augment nodes with precomputed
top-K for ranked suggestions.
12 · Inverted index
The core of search engines: a map from each term → posting list of documents (and positions). Index time tokenizes/normalizes/stems; query time fetches posting lists, intersects/unions them, then ranks with TF-IDF / BM25. Stored in immutable segments that merge LSM-style. Who: Elasticsearch/Lucene, Solr, Postgres GIN, Splunk.
Cheat sheet — choosing a structure
| Structure | Best for | Key op | Marquee users |
|---|---|---|---|
| Bloom filter | Probably present / definitely absent | O(k) |
Cassandra, RocksDB, HBase |
| Count-Min sketch | Approx frequency / top-K | O(d) |
Streaming analytics, ad counts |
| HyperLogLog | Approx distinct count | O(1) add |
Redis, BigQuery, Presto |
| LSM-tree + SSTable | Write-heavy KV / ingest | O(1) write |
Cassandra, RocksDB, HBase |
| B-tree / B+tree | Reads + range + OLTP | O(log n) |
PostgreSQL, MySQL InnoDB |
| Skip list | In-memory ordered, concurrent | O(log n) |
Redis ZSET, memtables |
| Consistent hashing | Elastic sharding / caching | O(log V) |
Dynamo, Cassandra, ketama |
| Merkle tree | Integrity / replica diff | O(log n) |
Cassandra repair, Git, blockchains |
| Geohash | Proximity via prefix | O(precision) |
Redis GEO, Elasticsearch |
| Quadtree / R-tree | Density-adaptive spatial | O(log n) |
Maps, games, PostGIS |
| Trie | Prefix search / autocomplete | O(L) |
Autocomplete, Lucene FST |
| Inverted index | Full-text search | O(1) term |
Elasticsearch, Lucene |
How to pick, in one breath
Membership at scale → bloom filter. "How many / top-K" in a stream → count-min (+ heap). "How many unique" over a firehose → HyperLogLog. Writes dominate → LSM-tree; reads + ranges + transactions → B-tree. Scaling nodes without reshuffling → consistent hashing. Repairing replica divergence → Merkle tree. "Find nearby" → geohash / quadtree. Prefix / autocomplete → trie; full-text → inverted index.