Fundamentals
Database Indexing
An index is an auxiliary structure that trades extra storage and slower writes for dramatically faster reads. Picking the right index type is one of the highest-leverage decisions in system design — and it ultimately drives which database you choose.
What an index is & why it exists
An index is a
secondary structure derived from the primary data
that lets the engine find rows without scanning the whole table.
Without one, WHERE user_id = 42 is a full table scan —
O(n). With one, the engine walks a balanced tree or
probes a hash and lands directly — O(log n) or
O(1) (≈23 hops for 10M rows vs 10M comparisons).
| You gain | You pay |
|---|---|
| Faster reads on indexed columns | Slower writes (every write updates the index too) |
| Sorted access / range scans (tree indexes) | Extra storage (~10–30% of table size per index) |
| Sometimes index-only reads | Memory pressure + write amplification |
Staff-level framing
Indexes don't make a database "faster" — they shift cost from read time to write time and storage. The skill is knowing which reads matter enough to pay that tax.
The decision framework
Two gates decide whether an index is even worth it; then the data shape decides which index.
flowchart TD
A{"Need efficient data access?"} -->|No| FTS["Full Table Scan"]
A -->|Yes| B{"Table over 10k rows?"}
B -->|No| FTS
B -->|Yes| C{"What type of data?"}
C -->|Full text search| INV["Inverted Index"]
C -->|Location data| GEO["Geospatial Index"]
C -->|In-memory exact match| HASH["Hash Index"]
C -->|Everything else| BTREE["B-Tree"]
Small tables live in memory — scanning is often faster than index indirection and keeps writes cheap. For large tables with real query patterns, pick by data shape; the B-tree is the 90% case.
B-tree / B+tree — the default ("everything else")
A balanced, high-fan-out tree. Each node is one disk page holding
hundreds of keys, so the tree stays shallow (3–4 levels for billions
of rows). In a B+tree all values live in the leaves
and leaves are linked left-to-right — that's what
makes range scans and ORDER BY cheap.
flowchart TD
R["Root [30 | 60]"]
R -->|"key under 30"| A["[10 | 20]"]
R -->|"30 to 59"| B["[40 | 50]"]
R -->|"60 and up"| C["[70 | 80]"]
A --> LA["Leaf 10 20 25 to rows"]
B --> LB["Leaf 40 50 55 to rows"]
C --> LC["Leaf 70 80 90 to rows"]
LA -.next.-> LB
LB -.next.-> LC
Great at: equality O(log n), ranges
(walk linked leaves), sorted retrieval, predictable read latency.
Costs: writes are random I/O — inserts may split
nodes and fragment pages under heavy write load.
LSM-tree — the write-optimized alternative
Never do random writes: buffer in memory, flush sorted immutable files, clean up later. Powers Cassandra, ScyllaDB, RocksDB, LevelDB, HBase.
flowchart TD
W["Write / Update / Delete"] --> WAL["WAL on disk (durability)"]
W --> MEM["Memtable (sorted, in-memory)"]
MEM -->|"flush when full"| S0["SSTable L0 (immutable)"]
S0 -->|compaction| S1["SSTable L1"]
S1 -->|compaction| S2["SSTable L2 (largest)"]
RD["Read"] --> MEM
RD --> BF["Bloom filter per SSTable"]
BF --> S0
BF --> S1
Writes append to a WAL + sorted memtable; full memtables flush to immutable SSTables; background compaction merges them and drops tombstones. Reads check memtable then SSTables newest→oldest, with a per-SSTable bloom filter to skip files.
B-tree vs LSM — know this cold
| Dimension | B-Tree | LSM-Tree |
|---|---|---|
| Write pattern | In-place, random I/O | Append-only, sequential I/O |
| Write throughput | Moderate | Very high |
| Read latency | Low, predictable | Higher / variable (multi-SSTable + bloom) |
| Range scans | Excellent (linked leaves) | Good but merges across levels |
| Write amplification | Lower (page rewrites) | Higher (rewritten across levels) |
| Background work | Minimal | Compaction (CPU/IO spikes) |
| Best for | Read-heavy, OLTP, ranges | Write-heavy, time-series, logs |
Clustered vs secondary indexes
This is about where the row data lives. A clustered index is the table — leaves hold the full rows, physically ordered by key (only one per table; the PK in InnoDB). A secondary index is a separate structure whose leaves hold the key + a pointer back to the row (many allowed).
flowchart LR
subgraph SEC["Secondary index on email"]
SK["jane@x.com"] --> PTR["PK = 101"]
end
subgraph CLU["Clustered index on PK"]
CK["101"] --> ROW["Full row 101 Jane ..."]
CK2["102"] --> ROW2["Full row 102 ..."]
end
PTR -.second lookup.-> CK
A secondary-index query does two traversals (the
"double lookup"): walk the secondary index to the PK, then walk the
clustered index to fetch the row. InnoDB clusters by PK (keep PKs
narrow — every secondary index stores the PK); Postgres uses an
unordered heap with all-secondary indexes pointing at a tuple ID,
which is why it needs VACUUM and supports index-only
scans.
Composite & covering indexes
Composite — the leftmost-prefix rule
An index on (a, b, c) is sorted by a, then
b, then c. It serves a,
a+b, a+b+c — but not
b alone (skips the leftmost column). Like a phone book
sorted by (last, first): great for "Smith, John," useless for everyone
named "John."
-
Equality before range: once you hit a range
(
>,BETWEEN,LIKE 'x%'), columns after it can't seek — put=predicates first, range last. - Match your most common query's prefix; one smart composite often replaces several single-column indexes.
Covering index (index-only scan)
Contains every column a query needs, so the engine answers from the
index and never touches the table — killing the double lookup.
Postgres/SQL Server use INCLUDE (...); MySQL uses a
composite that is a superset of selected + filtered columns.
Trade-off: wider index = more storage + slower writes, so cover only
hot queries.
Inverted, geospatial & hash indexes
Inverted index — full-text search
A leading-wildcard LIKE '%fast%'
cannot use a B-tree (sorted by prefix). An inverted
index precomputes term → posting list, so
"fast" is an O(1)-ish lookup; multi-word
queries intersect/union posting lists. Real engines add stemming,
stop-words, positions/phrases, and BM25 ranking.
Geospatial — location is 2D
-
Geohash: interleaves lat/long bits into a 1D
sortable string (nearby ⇒ shared prefix), e.g.
LA = 9qh16. "Find nearby" = prefix range scan + the 8 neighbor cells. Powers Redis GEO. -
Quadtree: splits a cell into 4 only when it exceeds
kpoints (density-adaptive). - R-tree: groups objects into minimum bounding rectangles — handles shapes/polygons; powers PostGIS.
Hash index — in-memory exact match
O(1) equality lookups, but
no ranges, no ordering (hashing destroys order) and
wants to live in RAM. Used by Redis hash tables, in-execution join
hash tables, Postgres USING hash.
When NOT to index
- Small tables (< ~10k rows) — the optimizer scans anyway; the index just slows writes.
- Rarely filtered columns — unused indexes are pure write + storage tax.
- Very low cardinality (e.g. a boolean) — a predicate matching half the table is cheaper as a scan.
- Write-heavy, read-rarely tables — each index multiplies write cost; favor LSM engines.
- Over-indexing — audit and drop unused indexes; they confuse the planner and slow every write.
How indexing maps to choosing a database
"What index do I need?" is really "what storage engine do I need?" — each engine bakes in an index philosophy.
| Workload signal | Index that fits | Databases that lead with it |
|---|---|---|
| General OLTP, ranges, joins | B+tree | PostgreSQL, MySQL/InnoDB, SQL Server |
| Massive write ingest, time-series | LSM-tree | Cassandra, ScyllaDB, RocksDB, HBase |
| In-memory key/value exact match | Hash | Redis, Memcached, DynamoDB |
| Full-text / relevance search | Inverted index | Elasticsearch, Solr/Lucene, Postgres GIN |
| Geospatial / "near me" | Geohash / quadtree / R-tree | PostGIS, Redis GEO, MongoDB 2dsphere |
| Analytics over huge scans (OLAP) | Columnar + zone maps | ClickHouse, BigQuery, Snowflake |
Interview move
Reason from access pattern → index → engine. "Heavy writes, key reads" → LSM → Cassandra. "Relational + ad-hoc ranges" → B-tree → Postgres. "Search box with ranking" → inverted index → Elasticsearch. "Drivers within 2 km" → geo index → Redis GEO / PostGIS.