Databases
Cost-Based Query Optimizer
The component that decides how to run your SQL — before a single row moves
A cost-based query optimizer chooses an execution plan by estimating the cost of join orders and access paths from table statistics, then picking the cheapest plan via dynamic programming over the join lattice.
- First designIBM System R, 1979
- Join-order searchDP over 2ⁿ subsets
- Left-deep plans (n tables)n! orders
- Dominant cost drivercardinality estimation
- Runsonce per query, before execution
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a cost-based optimizer works
SQL is declarative: you say what rows you want, never how to fetch them. For any non-trivial query there are thousands of correct ways to produce the same answer — scan a table or seek an index, join in this order or that one, hash the build side or sort-merge it. They all return identical rows but their runtimes can differ by a factor of a million. The optimizer's job is to pick a fast one without trying them all.
It does this in three stages, and crucially it never looks at your actual data while planning — only at precomputed statistics in the system catalog:
- Enumerate access paths. For each table, list the ways to read it: a sequential scan, an index scan on each available index, an index-only scan, a bitmap scan. Each carries an estimated cost in abstract units (typically modeled as disk pages read + CPU per row).
- Search join orders. Decide which order to join the tables and which physical join algorithm — nested loop, hash join, or sort-merge — to use at each step. This is the combinatorial heart of the problem.
- Cost and choose. Assign every candidate plan a number from the cost model and keep the cheapest. The winner becomes the physical plan tree handed to the executor.
The cost of every operator depends on cardinality — how many rows flow through it. A filter WHERE age > 40 on a 10-million-row table might pass 4 million rows or 40, and the optimizer's entire plan hinges on that estimate. Cardinality is derived from selectivity: the optimizer multiplies the table's row count by the estimated selectivity of each predicate, reading those selectivities off histograms and distinct-value counts gathered earlier by ANALYZE.
Join ordering: the combinatorial core
Joining is associative and commutative — (A ⋈ B) ⋈ C returns the same rows as A ⋈ (C ⋈ B) — so the optimizer is free to reorder. But the order it picks decides the size of the intermediate results, and intermediate size is what dominates cost. Join the two tables that produce 50 rows first, and the third join touches 50 rows; join the wrong pair first and you might materialize 50 million rows before the selective join ever runs.
The trouble is the size of the search space. For n tables the number of left-deep join orders (each join's left input is the running result, the right input a base table) is n!. Allow bushy trees, where two intermediate results join each other, and the count balloons to (2n−2)! / (n−1)! — over 17 billion for just 10 tables.
IBM's System R solved this in 1979 with the algorithm still used today: dynamic programming over subsets of tables. The insight is the principle of optimality — the cheapest plan for a set of tables must be built from the cheapest plans for its subsets. So you compute the best plan for every 1-table set, then every 2-table set, then every 3-table set, up to the full set, memoizing each. This turns the factorial enumeration into roughly O(3ⁿ) work and O(2ⁿ) memory — still exponential, but tractable up to about 12 tables before optimizers switch to a randomized or greedy heuristic (Postgres flips to its Genetic Query Optimizer at geqo_threshold = 12).
When cost-based optimization matters
- Analytical (OLAP) queries that join many fact and dimension tables — the place where a bad join order is most catastrophic and good statistics pay back the most.
- Ad-hoc and BI workloads where queries are not known in advance and you cannot hand-tune each one.
- Mixed-selectivity predicates where the right plan depends on the literal values —
WHERE country = 'US'(millions of rows) wants a different plan thanWHERE country = 'NR'(a handful). - Skewed data where one value dominates a column and a histogram is needed to plan correctly.
Cost-based optimization is overkill for trivial single-table point lookups — there the planner's heuristics fire instantly and there is nothing to choose. It is also a liability for latency-critical OLTP queries where plan stability matters more than peak efficiency; those workloads often pin plans or use prepared statements to avoid re-planning on every execution.
Cost-based vs rule-based vs heuristic optimization
| Cost-based (CBO) | Rule-based (RBO) | Heuristic / greedy | Adaptive / learned | |
|---|---|---|---|---|
| How it chooses | cheapest by cost model | fixed priority rules | greedy local choices | feedback from past runs |
| Needs statistics | yes — central | no | partial | yes + runtime samples |
| Adapts to data distribution | yes | no | weakly | strongly |
| Plan predictability | can shift with stats | fully deterministic | deterministic | changes over time |
| Search cost | O(3ⁿ) DP, capped | O(1) per query | O(n²) greedy join order | amortized over runs |
| Worst-case failure | bad plan from stale stats | ignores cheaper index plan | misses global optimum | cold-start mis-plan |
| Real-world use | Postgres, Oracle 10g+, SQL Server | legacy Oracle ≤9i | Postgres GEQO, MySQL | SQL Server, Redshift adaptive |
The headline trade-off is adaptivity versus predictability. A rule-based optimizer always picks the same plan for the same query text, which is comforting but blind to the actual data. A cost-based optimizer reads the data's shape from statistics and usually wins big — but when those statistics lie, it can swing from a great plan to a disastrous one with no code change at all. Modern engines are universally cost-based; the interesting frontier is adaptive optimization that corrects its own estimation errors at runtime.
What the numbers actually say
- The search space is the wall. 10 tables give 3,628,800 left-deep orders and over 17 billion bushy ones. System R's DP visits only the
2¹⁰ = 1024subsets, costing each once — about a 3,500× reduction versus enumerating left-deep plans, and millions-fold versus bushy. - Cardinality errors compound geometrically. A landmark 2015 study (Leis et al., "How Good Are Query Optimizers, Really?") showed estimation error grows by roughly an order of magnitude per join: a per-table error of 2× becomes a 1000× error after a 10-way join. The cost model is fine; the inputs rot.
- Join algorithm choice is a 100×+ lever. A nested-loop join of an
M-row andN-row table costsO(M·N); a hash join costsO(M + N). For two 1-million-row tables that is 10¹² probe operations versus 2×10⁶ — the difference between an hour and a second. - Statistics are cheap to maintain. Postgres samples ~300 ×
statistics_targetrows (30,000 by default) regardless of table size, soANALYZEon a billion-row table reads only tens of thousands of rows — seconds of work that prevents hours of bad plans.
JavaScript: System R join-order DP
This is the heart of every cost-based optimizer in miniature — a bitmask dynamic program over subsets of tables that finds the cheapest left-deep join order from base-table cardinalities and a join-selectivity model.
// tables: [{ name, rows }]; sel(i, j) = join selectivity between tables i and j (0..1)
function optimizeJoinOrder(tables, sel) {
const n = tables.length;
const FULL = (1 << n) - 1;
// best[mask] = { cost, card, order } — cheapest plan joining exactly the tables in mask
const best = new Array(1 << n);
// Base case: single tables cost a scan, cardinality = row count.
for (let i = 0; i < n; i++) {
best[1 << i] = { cost: tables[i].rows, card: tables[i].rows, order: [i] };
}
// Build up by subset size, smallest first (DP / principle of optimality).
for (let mask = 1; mask <= FULL; mask++) {
if (best[mask]) continue; // singletons already done
// Try every way to split mask into (sub) ⋈ (one new table).
for (let sub = (mask - 1) & mask; sub; sub = (sub - 1) & mask) {
const rest = mask ^ sub;
// left-deep: rest must be a single table (a base relation on the right)
if (rest & (rest - 1)) continue; // skip if rest has >1 bit set
const left = best[sub];
if (!left) continue;
const j = Math.log2(rest) | 0; // index of the new table
// Estimate output cardinality: product of inputs × combined selectivity.
let card = left.card * tables[j].rows;
for (const i of left.order) card *= sel(i, j);
// Hash-join cost ≈ build + probe + produce; add the left subtree's cost.
const cost = left.cost + left.card + tables[j].rows + card;
if (!best[mask] || cost < best[mask].cost) {
best[mask] = { cost, card, order: [...left.order, j] };
}
}
}
return best[FULL]; // { cost, card, order: [tableIndices in join order] }
}
Two details are load-bearing. The inner loop sub = (sub - 1) & mask is the classic trick for iterating over all submasks of a bitmask in O(2^popcount) total. And by requiring rest to be a single bit, we restrict to left-deep trees — drop that check and you explore bushy plans too, at the cost of a much larger search.
Python: cardinality and cost estimation
The DP above is only as good as the numbers fed into it. Here is the estimation layer — turning catalog statistics into the selectivity and cardinality figures the optimizer reasons over.
from dataclasses import dataclass
@dataclass
class ColumnStats:
rows: int # table cardinality
distinct: int # number of distinct values (NDV)
null_frac: float # fraction of NULLs
# histogram: list of equi-depth bucket bounds (omitted for brevity)
def selectivity_eq(stats: ColumnStats) -> float:
"""Selectivity of `col = const` under the uniform-distribution assumption."""
if stats.distinct == 0:
return 0.0
return (1.0 - stats.null_frac) / stats.distinct # 1 / NDV
def selectivity_range(stats, lo, hi, col_min, col_max) -> float:
"""Selectivity of `lo <= col < hi`, assuming uniform spread (no histogram)."""
span = col_max - col_min
if span <= 0:
return 1.0
frac = (min(hi, col_max) - max(lo, col_min)) / span
return max(0.0, min(1.0, frac)) * (1.0 - stats.null_frac)
def join_cardinality(left_rows, right_rows, ndv_left, ndv_right) -> float:
"""Textbook equi-join estimate: rows / max(distinct values on each side)."""
denom = max(ndv_left, ndv_right)
if denom == 0:
return 0.0
return (left_rows * right_rows) / denom
# Example: filter a 10M-row table on a column with 50 distinct values, then
# equi-join it to a 1M-row dimension whose key has 1M distinct values.
fact = ColumnStats(rows=10_000_000, distinct=50, null_frac=0.0)
after_filter = fact.rows * selectivity_eq(fact) # 10M / 50 = 200,000 rows
joined = join_cardinality(after_filter, 1_000_000, 200_000, 1_000_000)
print(int(after_filter), int(joined)) # 200000 200000
Notice the chain of assumptions: 1/NDV assumes every value is equally common (the uniformity assumption), and multiplying per-column selectivities assumes columns are independent. Both are routinely false in real data, and their failure is precisely why optimizers misestimate. Histograms patch uniformity; extended/multi-column statistics patch independence.
Variants and search strategies worth knowing
Bottom-up (System R / Selinger) DP. The classic — build optimal plans for ever-larger table sets. Used by Postgres for small joins, DB2, and most textbooks. Optimal within its plan space but exponential, so capped at ~12 tables.
Top-down with memoization (Volcano / Cascades). Start from the goal expression and recursively explore transformations, memoizing equivalent sub-expressions in a "memo" structure. Cascades, used by SQL Server and Apache Calcite, makes the optimizer extensible: rules are pluggable, so adding a new operator or rewrite is a rule rather than a rewrite of the core.
Genetic / randomized search. When the table count exceeds the DP threshold, the search space is too big to enumerate. Postgres's GEQO treats join orders as chromosomes and runs a genetic algorithm; others use simulated annealing or iterative improvement. These trade optimality guarantees for a plan in bounded time.
Adaptive query processing. Re-optimize at runtime when reality diverges from the estimate. SQL Server's Adaptive Joins defer the nested-loop-vs-hash decision until the build-side row count is actually known; Oracle's adaptive plans and Redshift's runtime statistics do similar. This directly attacks the cardinality-estimation problem instead of trying to predict it perfectly up front.
Learned optimizers. Research systems (Neo, Bao) replace or steer the cost model with a neural network trained on past query latencies. Promising on stable workloads, but cold-start behavior and explainability keep them out of mainstream production for now.
Common pitfalls and edge cases
- Stale statistics. The single most common cause of a sudden bad plan. Bulk-load a table and forget to run
ANALYZE, and the optimizer still thinks it is empty — picking nested loops that explode. Auto-analyze thresholds exist precisely to prevent this. - Correlated columns. The independence assumption multiplies selectivities, so
WHERE city = 'Paris' AND country = 'France'is estimated far too low because the optimizer doesn't know city implies country. Fix with multi-column / extended statistics. - Parameter sniffing. A prepared statement is planned once for the first parameter value, then reused. If the first value was atypical (a rare country code), every later execution inherits a plan tuned for the wrong selectivity. SQL Server and Oracle both have to actively defend against this.
- Estimation error compounding up the tree. An error in a leaf cardinality propagates and amplifies through every join above it. This is why deep join trees are fragile and why adaptive re-optimization targets exactly this failure mode.
- OR predicates and functions on indexed columns.
WHERE UPPER(name) = 'X'orWHERE a = 1 OR b = 2often defeat the histogram, falling back to a fixed default selectivity (Postgres uses 0.5% for unknown equalities) that can be wildly wrong. - Treating estimated cost as a runtime prediction. Cost units are abstract and only meaningful for ranking plans against each other. A plan with cost 1000 is not "10× slower" than one with cost 100 in wall-clock terms — never read the absolute number as milliseconds.
Frequently asked questions
What is the difference between a query optimizer and a query executor?
The optimizer is a planning phase that runs once per query: it explores candidate plans, estimates each one's cost from statistics, and emits the cheapest physical plan as a tree of operators. The executor then runs that fixed plan, pulling rows through the operator tree. The optimizer never touches data — it reasons only over catalog statistics like row counts and histograms.
Why is join ordering the hardest part of query optimization?
The number of possible join orders for n tables grows faster than exponentially — there are n! distinct left-deep orders, and far more — (2n−2)! / (n−1)! — when bushy trees are allowed. For 10 tables that is 3.6 million left-deep plans and over 17 billion bushy ones. The optimizer cannot enumerate them all, so it uses dynamic programming (System R) or randomized search to find a near-optimal order without exhaustive enumeration.
What are statistics in a cost-based optimizer?
Statistics are precomputed summaries of the data the optimizer consults instead of scanning rows: table row counts, number of distinct values per column, min/max bounds, null fractions, and histograms of value distribution. They are gathered by ANALYZE (Postgres) or DBMS_STATS (Oracle) and let the optimizer estimate how many rows a filter or join will produce — its cardinality.
Why do query plans sometimes go wrong?
Almost always because of cardinality misestimation. The cost model is accurate, but it is fed row-count estimates that can be off by orders of magnitude when statistics are stale, columns are correlated, or a predicate is unusual. A single estimate that is 1000× too low can cascade up the join tree and make the optimizer pick a nested-loop join over a hash join, turning a 1-second query into a 1-hour one.
Is a cost-based optimizer always better than a rule-based one?
Usually, but not always. A rule-based optimizer applies fixed heuristics (use an index if one exists, drive from the smallest table) and is deterministic and predictable. A cost-based optimizer adapts to actual data distribution and usually wins on complex queries, but it can pick a catastrophically bad plan when its statistics lie. Modern systems are all cost-based; Oracle dropped its rule-based optimizer in 10g.
What is a left-deep plan and why do optimizers prefer it?
A left-deep join tree always uses the running intermediate result as the left input and a base table as the right input, so joins chain linearly. Optimizers favor left-deep trees because they keep the search space tractable (n! orders instead of the much larger bushy space) and because they pipeline well — the build side of each hash join is a base table whose statistics are known exactly.