System Design Notes All designs

Fundamentals

Database Indexing

An index is an auxiliary structure that trades extra storage and slower writes for dramatically faster reads. Picking the right index type is one of the highest-leverage decisions in system design — and it ultimately drives which database you choose.

What an index is & why it exists

An index is a secondary structure derived from the primary data that lets the engine find rows without scanning the whole table. Without one, WHERE user_id = 42 is a full table scan — O(n). With one, the engine walks a balanced tree or probes a hash and lands directly — O(log n) or O(1) (≈23 hops for 10M rows vs 10M comparisons).

You gain You pay
Faster reads on indexed columns Slower writes (every write updates the index too)
Sorted access / range scans (tree indexes) Extra storage (~10–30% of table size per index)
Sometimes index-only reads Memory pressure + write amplification

Staff-level framing

Indexes don't make a database "faster" — they shift cost from read time to write time and storage. The skill is knowing which reads matter enough to pay that tax.

The decision framework

Two gates decide whether an index is even worth it; then the data shape decides which index.

flowchart TD
    A{"Need efficient data access?"} -->|No| FTS["Full Table Scan"]
    A -->|Yes| B{"Table over 10k rows?"}
    B -->|No| FTS
    B -->|Yes| C{"What type of data?"}
    C -->|Full text search| INV["Inverted Index"]
    C -->|Location data| GEO["Geospatial Index"]
    C -->|In-memory exact match| HASH["Hash Index"]
    C -->|Everything else| BTREE["B-Tree"]
      

Small tables live in memory — scanning is often faster than index indirection and keeps writes cheap. For large tables with real query patterns, pick by data shape; the B-tree is the 90% case.

B-tree / B+tree — the default ("everything else")

A balanced, high-fan-out tree. Each node is one disk page holding hundreds of keys, so the tree stays shallow (3–4 levels for billions of rows). In a B+tree all values live in the leaves and leaves are linked left-to-right — that's what makes range scans and ORDER BY cheap.

flowchart TD
    R["Root [30 | 60]"]
    R -->|"key under 30"| A["[10 | 20]"]
    R -->|"30 to 59"| B["[40 | 50]"]
    R -->|"60 and up"| C["[70 | 80]"]
    A --> LA["Leaf 10 20 25 to rows"]
    B --> LB["Leaf 40 50 55 to rows"]
    C --> LC["Leaf 70 80 90 to rows"]
    LA -.next.-> LB
    LB -.next.-> LC
      

Great at: equality O(log n), ranges (walk linked leaves), sorted retrieval, predictable read latency. Costs: writes are random I/O — inserts may split nodes and fragment pages under heavy write load.

LSM-tree — the write-optimized alternative

Never do random writes: buffer in memory, flush sorted immutable files, clean up later. Powers Cassandra, ScyllaDB, RocksDB, LevelDB, HBase.

flowchart TD
    W["Write / Update / Delete"] --> WAL["WAL on disk (durability)"]
    W --> MEM["Memtable (sorted, in-memory)"]
    MEM -->|"flush when full"| S0["SSTable L0 (immutable)"]
    S0 -->|compaction| S1["SSTable L1"]
    S1 -->|compaction| S2["SSTable L2 (largest)"]
    RD["Read"] --> MEM
    RD --> BF["Bloom filter per SSTable"]
    BF --> S0
    BF --> S1
      

Writes append to a WAL + sorted memtable; full memtables flush to immutable SSTables; background compaction merges them and drops tombstones. Reads check memtable then SSTables newest→oldest, with a per-SSTable bloom filter to skip files.

B-tree vs LSM — know this cold

Dimension B-Tree LSM-Tree
Write pattern In-place, random I/O Append-only, sequential I/O
Write throughput Moderate Very high
Read latency Low, predictable Higher / variable (multi-SSTable + bloom)
Range scans Excellent (linked leaves) Good but merges across levels
Write amplification Lower (page rewrites) Higher (rewritten across levels)
Background work Minimal Compaction (CPU/IO spikes)
Best for Read-heavy, OLTP, ranges Write-heavy, time-series, logs

Clustered vs secondary indexes

This is about where the row data lives. A clustered index is the table — leaves hold the full rows, physically ordered by key (only one per table; the PK in InnoDB). A secondary index is a separate structure whose leaves hold the key + a pointer back to the row (many allowed).

flowchart LR
    subgraph SEC["Secondary index on email"]
        SK["jane@x.com"] --> PTR["PK = 101"]
    end
    subgraph CLU["Clustered index on PK"]
        CK["101"] --> ROW["Full row 101 Jane ..."]
        CK2["102"] --> ROW2["Full row 102 ..."]
    end
    PTR -.second lookup.-> CK
      

A secondary-index query does two traversals (the "double lookup"): walk the secondary index to the PK, then walk the clustered index to fetch the row. InnoDB clusters by PK (keep PKs narrow — every secondary index stores the PK); Postgres uses an unordered heap with all-secondary indexes pointing at a tuple ID, which is why it needs VACUUM and supports index-only scans.

Composite & covering indexes

Composite — the leftmost-prefix rule

An index on (a, b, c) is sorted by a, then b, then c. It serves a, a+b, a+b+c — but not b alone (skips the leftmost column). Like a phone book sorted by (last, first): great for "Smith, John," useless for everyone named "John."

Covering index (index-only scan)

Contains every column a query needs, so the engine answers from the index and never touches the table — killing the double lookup. Postgres/SQL Server use INCLUDE (...); MySQL uses a composite that is a superset of selected + filtered columns. Trade-off: wider index = more storage + slower writes, so cover only hot queries.

Inverted, geospatial & hash indexes

Inverted index — full-text search

A leading-wildcard LIKE '%fast%' cannot use a B-tree (sorted by prefix). An inverted index precomputes term → posting list, so "fast" is an O(1)-ish lookup; multi-word queries intersect/union posting lists. Real engines add stemming, stop-words, positions/phrases, and BM25 ranking.

Geospatial — location is 2D

Hash index — in-memory exact match

O(1) equality lookups, but no ranges, no ordering (hashing destroys order) and wants to live in RAM. Used by Redis hash tables, in-execution join hash tables, Postgres USING hash.

When NOT to index

How indexing maps to choosing a database

"What index do I need?" is really "what storage engine do I need?" — each engine bakes in an index philosophy.

Workload signal Index that fits Databases that lead with it
General OLTP, ranges, joins B+tree PostgreSQL, MySQL/InnoDB, SQL Server
Massive write ingest, time-series LSM-tree Cassandra, ScyllaDB, RocksDB, HBase
In-memory key/value exact match Hash Redis, Memcached, DynamoDB
Full-text / relevance search Inverted index Elasticsearch, Solr/Lucene, Postgres GIN
Geospatial / "near me" Geohash / quadtree / R-tree PostGIS, Redis GEO, MongoDB 2dsphere
Analytics over huge scans (OLAP) Columnar + zone maps ClickHouse, BigQuery, Snowflake

Interview move

Reason from access pattern → index → engine. "Heavy writes, key reads" → LSM → Cassandra. "Relational + ad-hoc ranges" → B-tree → Postgres. "Search box with ranking" → inverted index → Elasticsearch. "Drivers within 2 km" → geo index → Redis GEO / PostGIS.