Zoom-in: Database Index

Query is slow. AI suggests adding an index. 10× faster afterward. Commit, done.

graph LR
    Q(["🔍 Query"]) -->|"WHERE email = 'x@y.com'"| T(["📋 Table"])
    T -->|"result"| R(["✅ Row"])
    style Q fill:#1e293b,stroke:#475569,color:#cbd5e1
    style T fill:#1e293b,stroke:#475569,color:#cbd5e1
    style R fill:#1e293b,stroke:#475569,color:#cbd5e1

Zoom in on the black box.

Layer 1 — No index: every query scans every row

Rows in a table are stored in insertion order — no logical ordering. Databases call this unordered storage the heap.

graph LR
    Q["🔍 WHERE email\n= 'x@y.com'"] -->|"scan"| R1["Row 1\nuser_a"]
    Q -->|"scan"| R2["Row 2\nuser_b"]
    Q -->|"scan"| R3["Row 3\nuser_c"]
    Q -->|"scan"| Rn["... Row N"]
    style Q fill:#1e3a5f,stroke:#3b82f6,color:#93c5fd
    style R1 fill:#1e293b,stroke:#334155,color:#94a3b8
    style R2 fill:#1e293b,stroke:#334155,color:#94a3b8
    style R3 fill:#1e293b,stroke:#334155,color:#94a3b8
    style Rn fill:#1e293b,stroke:#334155,color:#94a3b8

To find rows matching a WHERE condition, the database reads every row from top to bottom. With 10,000 rows, that's barely noticeable. With 10 million rows, the database checks all 10 million rows for every single query.

→ Problem left: need a way to find rows without scanning the entire table.

Layer 2 — B-tree Index: O(log n) lookups

An index is a separate data structure maintained alongside the table. The most common is a B-tree.

graph TD
    root["50"] --> L["25"]
    root --> R["75"]
    L --> LL["10"]
    L --> LR["37"]
    R --> RL["60"]
    R --> RR["90"]

    style root fill:#3b2a1a,stroke:#f59e0b,color:#fcd34d
    style L fill:#1e293b,stroke:#475569,color:#cbd5e1
    style R fill:#1e293b,stroke:#475569,color:#cbd5e1
    style LL fill:#1e293b,stroke:#334155,color:#94a3b8
    style LR fill:#1e293b,stroke:#334155,color:#94a3b8
    style RL fill:#1e293b,stroke:#334155,color:#94a3b8
    style RR fill:#1e293b,stroke:#334155,color:#94a3b8

Instead of scanning every row, the database navigates the tree: compare at each node, go left or right. With 10 million rows, a B-tree needs at most 23 steps to find any value (log₂ 10,000,000 ≈ 23).

Each leaf node stores the column value and a pointer back to the matching row in the heap. Find it in the B-tree → jump directly to that row.

→ Problem left: the B-tree must be updated every time data in the table changes.

Layer 3 — Write cost: every change has a price

Adding an index means every write must maintain two structures: the heap and the B-tree.

sequenceDiagram
    participant A as Application
    participant DB as Database

    A->>DB: INSERT INTO users (email, ...)
    DB->>DB: Write row to heap
    DB->>DB: Update B-tree index
    DB-->>A: OK

    A->>DB: UPDATE users SET email = ...
    DB->>DB: Update row in heap
    DB->>DB: Remove old node + insert new node into B-tree
    DB-->>A: OK

A table with 5 indexes means every INSERT updates 6 structures. An UPDATE on an indexed column must remove the old node and insert a new one in the right position — two operations instead of one.

→ Problem left: too many indexes slow writes — too few slow reads. Is there a smarter option?

Layer 4 — Covering Index: no heap lookup needed

Normally: find in B-tree → get pointer → read heap for remaining columns. A covering index includes every column the query needs upfront.

graph LR
    Q["🔍 SELECT email, name\nWHERE email = 'x@y.com'"] --> I["🌳 Index\n(email, name)"]
    I -->|"all data — no heap needed"| R["✅ Result"]
    style Q fill:#1e3a5f,stroke:#3b82f6,color:#93c5fd
    style I fill:#3b2a1a,stroke:#f59e0b,color:#fcd34d
    style R fill:#1a3a2a,stroke:#22c55e,color:#86efac

CREATE INDEX idx_users_email_name ON users (email, name);
-- SELECT email, name WHERE email = ... never touches the heap

Databases call this an Index Only Scan — faster than a regular Index Scan because it eliminates the heap lookup entirely. This is why some queries become unexpectedly fast after adding a multi-column index.

Full picture

How a query executes depending on what indexes exist.

graph TD
    Q["🔍 SELECT email, name\nWHERE email = 'x@y.com'"] --> P["Query Planner"]

    P -->|"No index"| FS["Full Scan\nO(n) — reads entire heap"]
    P -->|"Index on email"| IS["Index Scan\nO(log n) — B-tree\n→ pointer → heap"]
    P -->|"Covering index\n(email, name)"| CS["Index Only Scan\nO(log n) — B-tree\n→ result, no heap"]

    FS --> R["✅ Result"]
    IS --> R
    CS --> R

    style Q fill:#1e3a5f,stroke:#3b82f6,color:#93c5fd
    style P fill:#3b2a1a,stroke:#f59e0b,color:#fcd34d
    style FS fill:#3b1a1a,stroke:#ef4444,color:#fca5a5
    style IS fill:#1e293b,stroke:#475569,color:#cbd5e1
    style CS fill:#1a3a2a,stroke:#22c55e,color:#86efac
    style R fill:#1a3a2a,stroke:#22c55e,color:#86efac

Three common misconceptions

Indexes always make things faster

Not for small tables or queries returning most of the table. If a query fetches 80% of rows, a full scan can be faster because it avoids the B-tree-to-heap jump. The query planner knows this — it sometimes intentionally ignores an index.

Index every column to be safe

Each index is another structure to maintain. A write-heavy table with 10 indexes will show noticeably slower INSERTs. Only add an index when there's a specific query to optimize — and use EXPLAIN ANALYZE to measure before and after.

Column order in a composite index doesn't matter

Index (last_name, first_name) helps WHERE last_name = 'Nguyen' but does nothing for WHERE first_name = 'Nam' alone — the database still scans everything. Columns with higher cardinality should come first.

Takeaway

An index isn't a free speedup — it's a trade-off: faster reads, slower writes, extra storage. B-trees exist because the heap has no order; covering indexes exist because even the B-tree-to-heap jump has a cost.

When a query is slow, the right question isn't "did we add an index" but "what is the query planner doing" — EXPLAIN ANALYZE shows whether a full scan, index scan, or index only scan is happening.

Article assisted by Amy 🌸 - AI Assistant. Content reviewed by the author.

Zoom-in: Database Index

Layer 1 — No index: every query scans every row

Layer 2 — B-tree Index: O(log n) lookups

Layer 3 — Write cost: every change has a price

Layer 4 — Covering Index: no heap lookup needed

Full picture

Three common misconceptions

Takeaway

Related Posts

Zoom-in: TCP

Zoom-in: OAuth 2.0

Zoom-in: Load Balancer