Fundamentals

Database Indexing

An index is an auxiliary structure that trades extra storage and slower writes for dramatically faster reads. Picking the right index type is one of the highest-leverage decisions in system design — and it ultimately drives which database you choose.

What an index is & why it exists

An index is a secondary structure derived from the primary data that lets the engine find rows without scanning the whole table. Without one, WHERE user_id = 42 is a full table scan — O(n). With one, the engine walks a balanced tree or probes a hash and lands directly — O(log n) or O(1) (≈23 hops for 10M rows vs 10M comparisons).

You gain	You pay
Faster reads on indexed columns	Slower writes (every write updates the index too)
Sorted access / range scans (tree indexes)	Extra storage (~10–30% of table size per index)
Sometimes index-only reads	Memory pressure + write amplification

Staff-level framing

Indexes don't make a database "faster" — they shift cost from read time to write time and storage. The skill is knowing which reads matter enough to pay that tax.

The decision framework

Two gates decide whether an index is even worth it; then the data shape decides which index.

flowchart TD
    A{"Need efficient data access?"} -->|No| FTS["Full Table Scan"]
    A -->|Yes| B{"Table over 10k rows?"}
    B -->|No| FTS
    B -->|Yes| C{"What type of data?"}
    C -->|Full text search| INV["Inverted Index"]
    C -->|Location data| GEO["Geospatial Index"]
    C -->|In-memory exact match| HASH["Hash Index"]
    C -->|Everything else| BTREE["B-Tree"]

Small tables live in memory — scanning is often faster than index indirection and keeps writes cheap. For large tables with real query patterns, pick by data shape; the B-tree is the 90% case.

B-tree / B+tree — the default ("everything else")

A balanced, high-fan-out tree. Each node is one disk page holding hundreds of keys, so the tree stays shallow (3–4 levels for billions of rows). In a B+tree all values live in the leaves and leaves are linked left-to-right — that's what makes range scans and ORDER BY cheap.

flowchart TD
    R["Root [30 | 60]"]
    R -->|"key under 30"| A["[10 | 20]"]
    R -->|"30 to 59"| B["[40 | 50]"]
    R -->|"60 and up"| C["[70 | 80]"]
    A --> LA["Leaf 10 20 25 to rows"]
    B --> LB["Leaf 40 50 55 to rows"]
    C --> LC["Leaf 70 80 90 to rows"]
    LA -.next.-> LB
    LB -.next.-> LC

Great at: equality O(log n), ranges (walk linked leaves), sorted retrieval, predictable read latency. Costs: writes are random I/O — inserts may split nodes and fragment pages under heavy write load.

LSM-tree — the write-optimized alternative

Never do random writes: buffer in memory, flush sorted immutable files, clean up later. Powers Cassandra, ScyllaDB, RocksDB, LevelDB, HBase.

flowchart TD
    W["Write / Update / Delete"] --> WAL["WAL on disk (durability)"]
    W --> MEM["Memtable (sorted, in-memory)"]
    MEM -->|"flush when full"| S0["SSTable L0 (immutable)"]
    S0 -->|compaction| S1["SSTable L1"]
    S1 -->|compaction| S2["SSTable L2 (largest)"]
    RD["Read"] --> MEM
    RD --> BF["Bloom filter per SSTable"]
    BF --> S0
    BF --> S1

Writes append to a WAL + sorted memtable; full memtables flush to immutable SSTables; background compaction merges them and drops tombstones. Reads check memtable then SSTables newest→oldest, with a per-SSTable bloom filter to skip files.

B-tree vs LSM — know this cold

Dimension	B-Tree	LSM-Tree
Write pattern	In-place, random I/O	Append-only, sequential I/O
Write throughput	Moderate	Very high
Read latency	Low, predictable	Higher / variable (multi-SSTable + bloom)
Range scans	Excellent (linked leaves)	Good but merges across levels
Write amplification	Lower (page rewrites)	Higher (rewritten across levels)
Background work	Minimal	Compaction (CPU/IO spikes)
Best for	Read-heavy, OLTP, ranges	Write-heavy, time-series, logs

Clustered vs secondary indexes

This is about where the row data lives. A clustered index is the table — leaves hold the full rows, physically ordered by key (only one per table; the PK in InnoDB). A secondary index is a separate structure whose leaves hold the key + a pointer back to the row (many allowed).

flowchart LR
    subgraph SEC["Secondary index on email"]
        SK["jane@x.com"] --> PTR["PK = 101"]
    end
    subgraph CLU["Clustered index on PK"]
        CK["101"] --> ROW["Full row 101 Jane ..."]
        CK2["102"] --> ROW2["Full row 102 ..."]
    end
    PTR -.second lookup.-> CK

A secondary-index query does two traversals (the "double lookup"): walk the secondary index to the PK, then walk the clustered index to fetch the row. InnoDB clusters by PK (keep PKs narrow — every secondary index stores the PK); Postgres uses an unordered heap with all-secondary indexes pointing at a tuple ID, which is why it needs VACUUM and supports index-only scans.

Composite & covering indexes

Composite — the leftmost-prefix rule

An index on (a, b, c) is sorted by a, then b, then c. It serves a, a+b, a+b+c — but not b alone (skips the leftmost column). Like a phone book sorted by (last, first): great for "Smith, John," useless for everyone named "John."

Equality before range: once you hit a range (>, BETWEEN, LIKE 'x%'), columns after it can't seek — put = predicates first, range last.
Match your most common query's prefix; one smart composite often replaces several single-column indexes.

Covering index (index-only scan)

Contains every column a query needs, so the engine answers from the index and never touches the table — killing the double lookup. Postgres/SQL Server use INCLUDE (...); MySQL uses a composite that is a superset of selected + filtered columns. Trade-off: wider index = more storage + slower writes, so cover only hot queries.

Inverted, geospatial & hash indexes

Inverted index — full-text search

A leading-wildcard LIKE '%fast%' cannot use a B-tree (sorted by prefix). An inverted index precomputes term → posting list, so "fast" is an O(1)-ish lookup; multi-word queries intersect/union posting lists. Real engines add stemming, stop-words, positions/phrases, and BM25 ranking.

Geospatial — location is 2D

Geohash: interleaves lat/long bits into a 1D sortable string (nearby ⇒ shared prefix), e.g. LA = 9qh16. "Find nearby" = prefix range scan + the 8 neighbor cells. Powers Redis GEO.
Quadtree: splits a cell into 4 only when it exceeds k points (density-adaptive).
R-tree: groups objects into minimum bounding rectangles — handles shapes/polygons; powers PostGIS.

Hash index — in-memory exact match

O(1) equality lookups, but no ranges, no ordering (hashing destroys order) and wants to live in RAM. Used by Redis hash tables, in-execution join hash tables, Postgres USING hash.

When NOT to index

Small tables (< ~10k rows) — the optimizer scans anyway; the index just slows writes.
Rarely filtered columns — unused indexes are pure write + storage tax.
Very low cardinality (e.g. a boolean) — a predicate matching half the table is cheaper as a scan.
Write-heavy, read-rarely tables — each index multiplies write cost; favor LSM engines.
Over-indexing — audit and drop unused indexes; they confuse the planner and slow every write.

How indexing maps to choosing a database

"What index do I need?" is really "what storage engine do I need?" — each engine bakes in an index philosophy.

Workload signal	Index that fits	Databases that lead with it
General OLTP, ranges, joins	B+tree	PostgreSQL, MySQL/InnoDB, SQL Server
Massive write ingest, time-series	LSM-tree	Cassandra, ScyllaDB, RocksDB, HBase
In-memory key/value exact match	Hash	Redis, Memcached, DynamoDB
Full-text / relevance search	Inverted index	Elasticsearch, Solr/Lucene, Postgres GIN
Geospatial / "near me"	Geohash / quadtree / R-tree	PostGIS, Redis GEO, MongoDB 2dsphere
Analytics over huge scans (OLAP)	Columnar + zone maps	ClickHouse, BigQuery, Snowflake

Interview move

Reason from access pattern → index → engine. "Heavy writes, key reads" → LSM → Cassandra. "Relational + ad-hoc ranges" → B-tree → Postgres. "Search box with ranking" → inverted index → Elasticsearch. "Drivers within 2 km" → geo index → Redis GEO / PostGIS.