Understanding Database Indexes: How They Work and When to Use Them

Understanding Database Indexes: How They Work and When to Use Them

At some point, every developer writes a query that works fine in development and runs for 30 seconds in production. The data grew, nobody noticed, and now you have a problem. Indexes are almost always the fix — but adding them without understanding them leads to a different set of problems: bloated storage, slow writes, and indexes the query planner never actually uses.

What an Index Actually Is

The most common database index type is a B-tree (balanced tree) — see the Wikipedia overview of B-trees for the underlying data structure. Conceptually, it's a separate data structure maintained alongside your table that stores sorted copies of a column's values, with pointers back to the original rows.

Think of it like a book index. The book itself has information scattered across hundreds of pages. The index at the back lets you jump directly to "concurrency" on page 217 without reading the whole book. The database table is the book; the index is the lookup structure.

A B-tree index lets the database do binary search rather than a linear scan. For a table with 1 million rows, a full scan examines 1,000,000 rows. A B-tree binary search examines roughly 20 (log₂ of 1,000,000). That's a genuine order-of-magnitude difference on large tables.

Full Table Scan vs Index Scan

When you run a query, the database engine creates an execution plan. It decides whether to use an available index or just scan every row in the table. The decision is based on cost estimation.

-- Without an index on email:
-- Full table scan: reads every row, checks each one
SELECT * FROM users WHERE email = 'alice@example.com';

-- With an index on email:
-- Index scan: jumps directly to matching rows
SELECT * FROM users WHERE email = 'alice@example.com';

The query looks identical. The difference is entirely in the execution plan. You can see it with EXPLAIN:

EXPLAIN SELECT * FROM users WHERE email = 'alice@example.com';

PostgreSQL will show you whether it's doing a Seq Scan (sequential/full scan) or an Index Scan. For a detailed breakdown including actual timing:

EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';

EXPLAIN ANALYZE actually runs the query, so you get both the plan and the real execution times. It's the most useful diagnostic tool in your SQL toolkit. MySQL has the same feature with slightly different output format — see the MySQL EXPLAIN reference for the columns it emits.

Cardinality and Selectivity

Not every column benefits equally from an index. The key concept is selectivity — how much an index narrows down the result set.

High cardinality means many distinct values: email addresses, UUIDs, timestamps. An index on a high-cardinality column is very selective — it eliminates most rows quickly.

Low cardinality means few distinct values: boolean flags, status codes with 3–5 values. An index on WHERE active = true when 90% of users are active is nearly useless. It eliminates almost nothing, and the query planner will often prefer a full scan anyway.

A rough rule: if a column's values aren't very selective (less than ~10% of rows for a typical query), an index on that column alone probably won't help and may be ignored by the planner.

Composite Indexes and Column Order

A composite index covers multiple columns:

CREATE INDEX idx_orders_user_status ON orders (user_id, status);

This index can speed up queries that filter on user_id alone, or on user_id AND status together. It cannot efficiently speed up queries that filter on status alone (without user_id).

This is the left-prefix rule: a composite index is usable for any prefix of its columns, left to right. So (user_id, status, created_at) helps queries that filter on:

  • user_id
  • user_id, status
  • user_id, status, created_at

But not on just status or just created_at.

Column order in composite indexes matters enormously. Put the most selective column first, and put columns used in equality conditions (WHERE col = ?) before columns used in range conditions (WHERE col > ?).

Covering Indexes

A covering index is one that contains all the columns a query needs — so the database can answer entirely from the index without touching the table at all.

-- Query: get names and emails of active users
SELECT name, email FROM users WHERE active = true;

-- Covering index for this query:
CREATE INDEX idx_users_active_name_email ON users (active, name, email);

When the index includes the WHERE column plus every SELECT column, the database never needs to look up the actual table row. For high-traffic read queries, this can be a meaningful performance win.

When Indexes Hurt

Indexes aren't free. Every index you add has costs:

Write overhead. Every INSERT, UPDATE, or DELETE must maintain all indexes on that table. A table with eight indexes takes eight times as many index updates on write. For tables with heavy write traffic — event logs, audit tables, queues — that overhead adds up.

Storage. Indexes take disk space. A large table with many indexes can have indexes as large as the table itself. On cloud databases priced by storage, this matters.

Query planner confusion. Too many similar or redundant indexes means the planner has more options to consider and can occasionally make a worse choice. Keep your index set clean.

Indexes You Probably Need

Primary keys — your database creates these automatically. They're unique indexes on the PK column.

Foreign keys — if you JOIN on orders.user_id = users.id, you need an index on orders.user_id. Many databases don't create these automatically (PostgreSQL doesn't; MySQL's InnoDB does). Missing FK indexes are a common source of slow JOINs.

Columns in frequent WHERE clauses — if you query by email, username, or any field that acts as a lookup key, it needs an index.

Columns you ORDER BY on large result sets — sorting a large table without an index is expensive.

Indexes That Are Probably Wasted

Low-cardinality boolean or status columns in isolation — as covered above, these rarely help.

Columns that are never actually queried — review your index list occasionally. Indexes accumulate over time as features are added and removed.

Duplicate or redundant indexes — if you have (user_id) and (user_id, status), the single-column index is redundant for any query the composite index covers. Drop it.

A Practical Debugging Workflow

When a query is slow:

  1. Run EXPLAIN ANALYZE on the slow query.
  2. Look for Seq Scan on large tables — that's where you need an index.
  3. Check whether an existing index exists but isn't being used (the column might not be selective enough, or the query might not match the index's left prefix).
  4. Add the index, re-run EXPLAIN ANALYZE, and confirm it's being used.
  5. Measure the actual query time before and after.

The PostgreSQL documentation on `EXPLAIN` is thorough and worth bookmarking. MySQL's EXPLAIN output is similar but formatted differently.

Our SQL Formatter is useful for cleaning up complex queries before you analyze them — a well-formatted query is much easier to reason about when you're reading an execution plan.

For context on how indexes relate to broader database design, see SQL Basics: From Zero to Writing Real Queries and NoSQL vs SQL: Choosing the Right Database.

Indexes are one of the highest-leverage tools in database performance. Understand them rather than just adding them blindly — you'll add the right ones, skip the useless ones, and know exactly where to look when a query slows down.

FAQ

How many indexes is too many on a single table?

There's no hard rule, but most teams keep it under 5-7 indexes per table. Each index costs disk space and slows down writes — a table with 10 indexes takes 10× the index-update overhead per insert. For write-heavy tables (logs, audit, queues), keep indexes minimal. For read-heavy tables (lookup tables, reports), more indexes are usually fine. Always measure: EXPLAIN ANALYZE your hot queries and see if the indexes are actually used.

Should I index every foreign key column?

Almost always yes. PostgreSQL doesn't auto-create FK indexes (MySQL InnoDB does). Without an FK index, JOINs and cascade deletes scan the whole child table. The classic symptom: a DELETE FROM users WHERE id = 5 that locks for 30 seconds because Postgres has to scan the orders table looking for user_id = 5. Add the index — it's almost always worth it.

What's the difference between a B-tree and a hash index?

B-tree supports range queries (<, >, BETWEEN, ORDER BY), prefix matching (LIKE 'abc%'), and equality. Hash index only supports equality (=). For pure equality lookups on high-cardinality columns, hash is slightly faster. For everything else, B-tree wins. PostgreSQL defaults to B-tree because it covers the most cases; specialized hash indexes are rarely needed.

Standard B-tree indexes don't help with full-text search (WHERE body LIKE '%word%' can't use an index because of the leading wildcard). For real full-text search, use specialized index types: PostgreSQL's GIN with tsvector, MySQL's FULLTEXT indexes, or external systems (Elasticsearch, Meilisearch). These are inverted indexes that map words to documents, not B-trees.

Why isn't my new index being used?

Several common reasons: the column isn't selective enough (the planner thinks a sequential scan is cheaper), the WHERE clause uses functions (LOWER(email) can't use an index on email), the type doesn't match (text column compared to numeric), or statistics are stale (run ANALYZE). Check EXPLAIN to see what the planner picked and why; the cost numbers tell you whether the planner considered your index at all.

Should I index columns I ORDER BY?

Yes, especially for LIMIT queries. An index on the ORDER BY column lets the database read rows in order without sorting — critical for queries like "get the latest 100 events." A composite index on (user_id, created_at DESC) accelerates "latest events per user" without a sort step. Without the index, the database materializes the full result set, sorts it, then takes the top N.

What's a partial index?

An index that only covers rows matching a WHERE clause: CREATE INDEX idx_active_users ON users (email) WHERE active = true. Useful when most queries filter on a specific subset (active records, recent rows, non-soft-deleted entries). Partial indexes are smaller, faster to maintain, and skip the columns the planner doesn't care about. PostgreSQL supports them natively; MySQL has limited support via generated columns.

How do I find unused indexes?

PostgreSQL: SELECT * FROM pg_stat_user_indexes WHERE idx_scan = 0 shows indexes never used since stats reset. MySQL: enable Performance Schema and query sys.schema_unused_indexes. Run this on a representative production workload — indexes that look unused in dev might be critical in prod. After confirming an index is genuinely unused for weeks, drop it; every dropped index is one less write to maintain.