Databases

Hash Join

Q: Why build the hash table on the smaller table?

The hash table has to fit in memory, and you scan the other table only once to probe it. Building on the smaller (the "build" side) minimizes memory use and the chance of spilling to disk. The optimizer estimates which input is smaller from table statistics; if it guesses wrong, the build side may not fit and the join spills.

Q: When is a hash join faster than a nested loop join?

Hash join wins when both inputs are large and unsorted and the predicate is an equality. It is O(n + m) instead of the nested loop's O(n · m). A nested loop join still wins when one side is tiny or there is an index on the inner table's join column, because then the inner lookup is O(log m) per outer row with no build cost.

Q: Can a hash join do range or inequality joins?

No. A hash table answers "does this exact key exist?" so hash join only handles equi-joins (a = b). For range predicates (a < b, a BETWEEN x AND y) you need a sort-merge join or a nested loop with an index range scan. This is the single biggest limitation of hash join.

Q: What is a grace hash join?

When the build side does not fit in memory, grace hash join partitions both tables into buckets by hashing the join key, writing each bucket to disk. Because matching keys land in the same bucket on both sides, it then joins the buckets pairwise in memory. It trades one extra round of disk I/O for the ability to join inputs far larger than RAM.

Q: Why can a hash join blow up on skewed data?

If millions of build rows share one join key, they all chain into a single hash bucket, and every matching probe row must scan that long chain. The join degrades toward O(n · m) for the skewed key. Partitioning does not help because skew sends a whole partition to one bucket — engines add skew-handling heuristics or fall back to other join methods.

Q: Does the output of a hash join preserve any order?

No. Rows come out in probe-scan order interleaved with hash-bucket order, which is effectively arbitrary. If the query has an ORDER BY, the planner must add an explicit sort afterward — one reason sort-merge join can win when the result needs to be sorted anyway.

Build a hash table on one table, fire the other at it — and the n×m nested loop collapses to a single pass

A hash join matches rows from two tables by building an in-memory hash table on the smaller table's join key, then probing it with each row of the larger table — turning an O(n·m) nested loop into O(n + m).

Time complexityO(n + m)
Build memory≈ size of smaller table
Predicate supportEqui-joins only
Output orderUnsorted
Spills to diskGrace / hybrid variants

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a hash join works

Suppose you want to join an orders table (10 million rows) to a customers table (200 thousand rows) on customer_id. The naive approach — for each order, scan every customer looking for a match — does 10M × 200K = 2 trillion comparisons. That is a nested loop join, and on cold inputs it is hopeless.

A hash join replaces the inner scan with a hash-table lookup, and it runs in two phases:

Build. Pick the smaller input (here, customers). Scan it once. For every row, compute hash(customer_id) and insert the row into an in-memory hash table keyed on that value. After the build phase the hash table holds all 200K customers, bucketed by key.
Probe. Scan the larger input (orders) once. For every order, compute hash(customer_id), jump to the matching bucket, and emit a joined row for each customer in that bucket whose key equals the order's key.

The build side is read once; the probe side is read once. Each row touches the hash table in expected O(1) time, so the whole join is O(n + m) rather than O(n · m). The 2-trillion-comparison nested loop becomes roughly 10.2 million hashes and lookups — a five-orders-of-magnitude difference.

The one hard requirement: the predicate must be an equality. A hash table can answer "is key k present?" but it cannot answer "which keys are less than k?" So hash join handles orders.customer_id = customers.id but never orders.amount > customers.credit_limit.

When to choose a hash join

Both inputs are large and unsorted. This is the home turf. Neither side has a useful index on the join column, and sorting both sides for a merge join would cost more than building one hash table.
The predicate is a pure equi-join. Exactly one equality on the join keys, no range component.
The smaller side fits (mostly) in the memory budget. If the build side fits in the engine's working memory, you get the clean in-memory join with no spill.
You don't need the result sorted. If there is no ORDER BY on the join key, the unordered output of a hash join costs you nothing.

Avoid it when one input is tiny (a nested loop with an index is cheaper), when the join is a range or inequality (use sort-merge), or when the data is badly skewed on the join key (a single mega-bucket erases the O(1) lookup).

Hash join vs the other two join algorithms

Every relational engine ships three physical join operators, and the optimizer picks between them per query based on cardinality estimates, available indexes, and memory.

	Hash join	Nested loop join	Sort-merge join
Time complexity	O(n + m)	O(n · m), or O(n · log m) with inner index	O(n log n + m log m), or O(n + m) if pre-sorted
Memory	≈ smaller table (build side)	O(1)	O(n + m) for sort buffers
Predicates	Equi-join only	Any (equality, range, inequality, function)	Equality and range
Output order	Unsorted	Outer-table order	Sorted on join key
Needs index?	No	Wins with index on inner	Wins if inputs already sorted
Best when	Two big unsorted tables, equi-join	One side tiny, or indexed lookup	Inputs pre-sorted, or result needs sorting

The rule of thumb planners encode: small × anything → nested loop with index; big × big equi-join → hash join; big × big where both sides arrive sorted (e.g. from index scans) → sort-merge. Postgres, SQL Server, and Oracle implement all three; MySQL added hash join in 8.0.18 but still has no general-purpose sort-merge join (it falls back to nested loop or hash join); SQLite implements only nested loop join, using a transient B-tree "automatic index" rather than a hash table when an equi-join would otherwise lack a usable index.

What the numbers actually say

Comparisons: 2 trillion → 10.2 million. For the 10M × 200K example above, nested loop does n·m = 2×10¹² comparisons; hash join does one hash per build row plus one hash + bucket scan per probe row ≈ 1.02×10⁷ operations. That is the difference between minutes and milliseconds of CPU.
Build memory ≈ smaller table × row overhead. 200K customer rows at ~120 bytes each plus hash-table pointer overhead lands around 30–40 MB — trivially in-memory. The same join with a 50M-row build side would need several gigabytes, so against a typical per-operator memory budget of a few hundred megabytes it would spill to disk.
Grace hash join costs one extra I/O pass. Partitioning both inputs to disk and reading them back roughly doubles the I/O of the build/probe scans, but lets you join inputs hundreds of times larger than RAM. The classic GRACE database machine result: join cost stays near linear in total data even when nothing fits in memory.
Expected bucket length ≈ 1 with a good hash and load factor < 1. That is what keeps probe lookups O(1). Under heavy skew, one bucket can hold millions of rows and that single key's join degrades to O(n·m) for those rows.

JavaScript implementation

An in-memory hash join. build is the smaller table, probe the larger. JavaScript Map is itself a hash table, and one key can match many rows, so each map value is an array of build rows — this is what makes the join handle one-to-many correctly.

function hashJoin(build, probe, buildKey, probeKey) {
  // Phase 1 — BUILD: index the smaller table by its join key.
  const table = new Map();
  for (const row of build) {
    const k = row[buildKey];
    if (k == null) continue;            // SQL NULL never equi-joins
    let bucket = table.get(k);
    if (!bucket) table.set(k, (bucket = []));
    bucket.push(row);                   // many rows may share a key
  }

  // Phase 2 — PROBE: scan the larger table, look each row up.
  const out = [];
  for (const r of probe) {
    const bucket = table.get(r[probeKey]);
    if (!bucket) continue;              // no match — inner join drops it
    for (const b of bucket) out.push({ ...b, ...r });
  }
  return out;
}

const customers = [{ id: 1, name: 'Ada' }, { id: 2, name: 'Linus' }];
const orders    = [{ oid: 'A', cust: 1 }, { oid: 'B', cust: 1 }, { oid: 'C', cust: 9 }];
hashJoin(customers, orders, 'id', 'cust');
// [{ id:1, name:'Ada', oid:'A', cust:1 }, { id:1, name:'Ada', oid:'B', cust:1 }]
// order C (cust 9) has no matching customer → dropped by the inner join

To make it a left outer join that preserves every probe row, build the hash table on the other (right) table and probe with the table you want fully preserved. Then emit a joined row for every probe row, substituting nulls whenever the probe finds no matching bucket — so unmatched left rows still appear in the output instead of being dropped.

Python implementation

The same two-phase structure, plus a sketch of the grace variant that partitions to disk when the build side is too large to hold at once.

from collections import defaultdict

def hash_join(build, probe, build_key, probe_key):
    # Phase 1 — BUILD
    table = defaultdict(list)
    for row in build:
        k = row[build_key]
        if k is None:                 # NULLs do not equi-join
            continue
        table[k].append(row)

    # Phase 2 — PROBE
    for r in probe:
        for b in table.get(r[probe_key], ()):   # empty tuple = no match
            yield {**b, **r}

# Grace hash join: when `build` won't fit, partition BOTH inputs by
# hash(key) % P so matching keys land in the same partition pair,
# then join the (much smaller) partitions one pair at a time.
def grace_hash_join(build, probe, build_key, probe_key, P=64):
    b_parts = [[] for _ in range(P)]
    p_parts = [[] for _ in range(P)]
    for row in build:
        b_parts[hash(row[build_key]) % P].append(row)   # in reality: spill to disk
    for row in probe:
        p_parts[hash(row[probe_key]) % P].append(row)
    for i in range(P):
        # Each pair now fits in memory — recurse / fall back to in-memory join.
        yield from hash_join(b_parts[i], p_parts[i], build_key, probe_key)

Two production-grade details the toy code skips. First, the partition hash must differ from the in-bucket hash; reusing the same function makes every key in a partition collide into one bucket. Second, real engines spill partitions to disk and keep only the partition currently being joined resident — that is the whole point of grace.

Variants worth knowing

Classic (in-memory) hash join. Build side fits in working memory; single build pass, single probe pass. The fastest case and the one the visualization above shows.

Grace hash join. Build side doesn't fit. Partition both inputs by hashing the key into P buckets, spill buckets to disk, then join matching bucket pairs one at a time. Named after the 1980s GRACE database machine. Recursive grace re-partitions any bucket that is still too big.

Hybrid hash join. The default in most engines. Like grace, but it keeps the first partition resident in memory while partitioning the rest to disk, so it never pays disk I/O for the part of the build side that fits. Degrades gracefully from pure in-memory to fully partitioned as memory shrinks.

Symmetric (pipelined) hash join. Used in streaming and adaptive query engines. Build a hash table on both sides; each arriving row probes the opposite table and is then inserted into its own. Produces results incrementally without waiting for either input to finish — essential when one input is an unbounded stream.

Bloom-filter / sideways-information-passing hash join. Build a Bloom filter of the build keys and push it down to the probe-side scan, so rows that can't possibly match are discarded before they're even read off disk. Spark, Snowflake, and most MPP engines do this as "runtime filter" pushdown.

Common bugs and edge cases

Building on the wrong (larger) side. Bad cardinality estimates make the optimizer build on the big table; the hash table blows past the memory budget and spills, killing performance. Stale statistics are the usual culprit — run ANALYZE.
Treating NULL as a value. In SQL, NULL = NULL is unknown, so NULL keys must never match. Skip NULL keys in build and probe, or you'll emit phantom rows.
Reusing one hash function for partition and bucket. In grace/hybrid joins, the partitioning hash and the in-memory bucket hash must be independent, or every key in a partition collides into a single chain.
Data skew on the join key. A "guest" / NULL-surrogate / default key shared by millions of rows creates one giant bucket; that key's join degrades to quadratic. Detect skew and split the hot key out, or fall back to another join.
Assuming sorted output. Code (or a downstream merge step) that relies on hash-join output being ordered will silently break. Add an explicit sort if you need order.
Forgetting one-to-many fan-out. A bucket can hold many build rows; emitting only the first match drops valid join results. Always iterate the whole bucket.

Frequently asked questions

Why build the hash table on the smaller table?