Databases

Sort-Merge Join

Sort both sides once, then zip them together in a single pass

Sort-merge join joins two tables by sorting both on the join key, then merging them in a single synchronized pass with two cursors — O(n log n + m log m) when sorting dominates, O(n + m) when the inputs already arrive sorted.

  • Sort phaseO(n log n + m log m)
  • Merge phaseO(n + m)
  • Pre-sorted inputsO(n + m)
  • Worst case (skewed key)O(n · m)
  • Join typeEqui-join only
  • Output orderSorted on join key

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How sort-merge join works

A join answers a simple question — "for each row on the left, which rows on the right share the same key?" — but answering it naively is quadratic. A nested-loop join compares every left row against every right row: 1 million rows joined to 1 million rows is 1012 comparisons. Sort-merge join trades that quadratic scan for a tiny bit of preparation. Once both inputs are sorted on the join key, the matches line up next to each other, and you can find them all in one walk down the two sorted lists.

The algorithm has exactly two phases:

  1. Sort. Sort the left input on the join key. Sort the right input on the same key. For large inputs this is an external merge sort — break the data into memory-sized runs, sort each, and merge the runs. If the data already comes pre-sorted (from a B+ tree index, or as the output of a previous operator), this phase is free.
  2. Merge. Place a cursor at the top of each sorted input. Compare the two keys under the cursors. If the left key is smaller, advance the left cursor; if the right key is smaller, advance the right cursor; if they're equal, emit the match and advance carefully (more on that below). Repeat until one input runs out.

The merge pass is the same merge step you already know from merge sort — but instead of interleaving two sorted lists into one, you're pairing up rows with equal keys. It scans each input exactly once, so the merge itself is O(n + m).

The one subtlety is duplicate keys. If key 42 appears three times on the left and twice on the right, the join must emit all 3 × 2 = 6 pairs — a Cartesian product of the two matching groups. You handle this by marking the start of the right group, replaying it for every left row that shares the key, and only moving on once the left key changes.

The synchronized merge, step by step

Consider these two already-sorted inputs joining on a shared key:

Left  (L):  10  20  20  30  50
Right (R):  20  20  30  40  50

cursors:  l=10, r=20  → 10 < 20, advance L
          l=20, r=20  → match! mark r at 20. Emit (20,20a),(20,20b).
                         left has another 20 → restore r, emit (20,20a),(20,20b)
                         left key changes to 30 → drop the mark, advance R past the 20s
          l=30, r=30  → match! emit (30,30)
          l=50, r=40  → 50 > 40, advance R
          l=50, r=50  → match! emit (50,50)
          L or R exhausted → done

Two things to notice. First, every comparison either emits a row or advances exactly one cursor, so the work is bounded by the input sizes plus the output size. Second, the mark-and-restore on the right cursor is the only place the linear scan breaks down — when both sides have long duplicate runs, you replay the right run once per left duplicate, which is where the O(n · m) worst case comes from.

When to choose sort-merge join

  • At least one input is already sorted on the join key — a clustered B+ tree index scan, or the output of an upstream sort or merge join. The expensive phase disappears and merge join beats hash join outright.
  • The query needs sorted output anyway — an ORDER BY on the join key, a GROUP BY, a DISTINCT, or a downstream merge join. The sort pays for two things at once.
  • Memory is tight. Hash join wants the smaller input's hash table resident in RAM; when it spills, performance falls off a cliff. External merge sort degrades gracefully and streams in large sequential I/Os.
  • Outer joins and full outer joins fall out naturally: when one key is smaller, emit its row NULL-padded instead of skipping it.
  • Large-to-large joins where neither input fits in memory. Two sorted streams merging is the canonical big-data join, which is why it's the workhorse of MapReduce, Spark, and most distributed query engines.

Skip it when both inputs are unsorted, fit in memory, and the output order doesn't matter — that's hash join's home turf. Skip it for a tiny dimension table joined to a huge fact table — an index nested-loop join probing the small table is usually cheaper than sorting the big one.

Sort-merge join vs hash join vs nested-loop join

Sort-merge joinHash joinNested-loop joinIndex nested-loop
Time (unsorted)O(n log n + m log m)O(n + m) expectedO(n · m)O(n · log m)
Time (pre-sorted input)O(n + m)O(n + m)O(n · m)O(n · log m)
Worst caseO(n · m) on skewed keysO(n · m) on hash collisions / skewO(n · m)O(n · log m)
Memory neededsort buffers; spills gracefullybuild-side hash table; cliff on spillO(1)O(1) + index
Output sorted?Yes, on join key (free)NoPreserves outer orderPreserves outer order
Equi-join only?YesYesNo — any predicateNo — any indexable predicate
I/O patternSequential (sort + merge)Random once table spillsSequential outer, repeated innerRandom index probes
Best forLarge↔large, pre-sorted, sorted outputLarge↔large, unsorted, memory OKTiny inputs, non-equi predicatesBig fact ↔ small indexed dim

The headline trade-off: hash join is asymptotically cheaper on cold unsorted inputs (no log factor), but sort-merge join is more robust. It produces sorted output for free, spills to disk gracefully, handles outer joins cleanly, and never builds a hash table that can explode under key skew or hash collisions. Most cost-based optimizers — PostgreSQL, Oracle, SQL Server — carry all of these and pick per query based on cardinality estimates and existing sort orders.

What the numbers actually say

  • The sort dominates. For two unsorted 10-million-row inputs, the merge is ~20 million comparisons (O(n + m)), but each external sort is ~10M · log₂(10M) ≈ 10M · 23 ≈ 230 million comparisons. The sort is roughly 20× the merge — so the whole game is whether you can avoid the sort.
  • Pre-sorted input is the win condition. If both inputs arrive sorted, you skip ~460 million comparisons and the join collapses to a single 20-million-step pass. That's why a planner that sees a usable index will often pick merge join even when hash join looks cheaper on paper.
  • External sort I/O. External merge sort reads and writes each input roughly 1 + ⌈log_B(N/M)⌉ times, where M is memory and B is the merge fan-in. With a few gigabytes of sort memory and a high fan-in, even a terabyte input is sorted in 2–3 passes — all sequential I/O, where a disk happily does 200+ MB/s versus a few MB/s on random probes.
  • Skew is the trap. If one key value (say a NULL-replacement or a default) appears in 30% of both inputs, the matching cross-product alone is 0.3n × 0.3m = 0.09·n·m rows — quadratic output that no algorithm can dodge, but sort-merge's right-cursor replay makes the CPU cost visible too.

JavaScript implementation

This is the merge phase for an inner equi-join with full duplicate handling. It assumes both inputs are already sorted on the key; in a real engine you'd sort them first.

// left, right: arrays of rows already sorted ascending by keyFn.
// Emits one combined row per matching pair (Cartesian within a key group).
function sortMergeJoin(left, right, keyFn, combine) {
  const out = [];
  let i = 0, j = 0;

  while (i < left.length && j < right.length) {
    const lk = keyFn(left[i]);
    const rk = keyFn(right[j]);

    if (lk < rk) {
      i++;                       // left key trails — advance left
    } else if (lk > rk) {
      j++;                       // right key trails — advance right
    } else {
      // Equal keys: cross-product of the two matching runs.
      const mark = j;            // remember where the right run begins
      while (i < left.length && keyFn(left[i]) === lk) {
        j = mark;                // restore right cursor for each left row
        while (j < right.length && keyFn(right[j]) === lk) {
          out.push(combine(left[i], right[j]));
          j++;
        }
        i++;
      }
      // j already sits just past the right run; left advanced past its run.
    }
  }
  return out;
}

// Sort-then-merge wrapper for unsorted inputs:
function join(left, right, keyFn, combine) {
  const byKey = (a, b) => (keyFn(a) < keyFn(b) ? -1 : keyFn(a) > keyFn(b) ? 1 : 0);
  return sortMergeJoin([...left].sort(byKey), [...right].sort(byKey), keyFn, combine);
}

The mark / restore pair is the whole trick. When the keys are equal we freeze the right cursor at the start of its run, then for every left row with that key we rewind j back to mark and replay the right run. Drop the restore and you'd emit only the first left row's matches.

Python implementation

The same algorithm, plus a left-outer variant — note how cleanly the outer case slots into the "left key is smaller" branch.

def sort_merge_join(left, right, key, combine):
    """Inner equi-join of two iterables already sorted by `key`."""
    left, right = sorted(left, key=key), sorted(right, key=key)
    out, i, j = [], 0, 0
    n, m = len(left), len(right)

    while i < n and j < m:
        lk, rk = key(left[i]), key(right[j])
        if lk < rk:
            i += 1
        elif lk > rk:
            j += 1
        else:
            mark = j
            while i < n and key(left[i]) == lk:
                j = mark
                while j < m and key(right[j]) == lk:
                    out.append(combine(left[i], right[j]))
                    j += 1
                i += 1
    return out


def left_outer_merge_join(left, right, key, combine, null_row):
    """Left rows with no match are emitted padded with `null_row`."""
    left, right = sorted(left, key=key), sorted(right, key=key)
    out, i, j = [], 0, 0
    n, m = len(left), len(right)

    while i < n:
        if j >= m or key(left[i]) < key(right[j]):
            out.append(combine(left[i], null_row))   # no match for this left row
            i += 1
        elif key(left[i]) > key(right[j]):
            j += 1
        else:
            lk, mark = key(left[i]), j
            while i < n and key(left[i]) == lk:
                j = mark
                while j < m and key(right[j]) == lk:
                    out.append(combine(left[i], right[j]))
                    j += 1
                i += 1
    return out

The left-outer version replaces "advance left silently" with "emit a NULL-padded row, then advance left." That single line is the entire difference between an inner and a left-outer merge join — hash join needs a separate bitmap of matched build-side rows to do the same thing.

Variants worth knowing

One-sided sort (merge join on an indexed side). If only one input is sorted — say it's the output of an index scan — you sort just the other input and merge. The planner loves this because it halves the sort cost.

Many-to-one (unique key) fast path. When the right side's key is unique (a primary-key side), there are no duplicate runs to replay, so the mark-and-restore is unnecessary and the merge is strictly O(n + m) with no skew risk. Engines detect this from key constraints and skip the backup logic.

Band join / merge-based inequality join. A generalization that handles t1.x BETWEEN t2.y - k AND t2.y + k by keeping a sliding window of right rows instead of an equality run. Hash join simply can't do this, so merge-style joins remain relevant for range predicates.

Sort-merge group-by (aggregation). The exact same sorted-run logic, but instead of joining two inputs you fold each run of equal keys into one aggregate row. If you understand merge join you already understand sort-based aggregation.

Distributed shuffle-sort-merge. In Spark and MapReduce, both sides are partitioned by a hash of the key (the "shuffle"), each partition is sorted, and partitions with the same hash are merge-joined on different machines. It's the default big-data join precisely because it spills and streams instead of demanding a resident hash table.

Common bugs and edge cases

  • Forgetting to restore the right cursor. The single most common bug — drop the j = mark line and many-to-many joins silently lose rows. Test with duplicates on both sides; one-sided duplicates won't catch it.
  • Comparing keys with the wrong collation or type. The merge assumes both inputs are sorted by the same comparison. If the left side was sorted case-sensitively and the right case-insensitively, equal keys won't line up and matches vanish. Mismatched numeric vs string sort order is the same trap.
  • NULL handling. SQL says NULL never equals NULL, so NULL keys must not join. But sort order places all NULLs together (usually first or last), so a careless merge will happily cross-product them. Filter NULL keys before the merge.
  • Assuming the merge is always O(n + m). Under heavy two-sided duplication the right run is replayed per left duplicate, degrading toward O(n · m). The cost model must account for key cardinality, not just row counts.
  • Using merge join when a hash table would do. If both inputs are small and unsorted and you don't need sorted output, you paid two log factors for nothing — hash join is simpler and faster.
  • Re-sorting an already-sorted input. If the planner can't tell an input is sorted (lost interesting-order tracking after a projection), it re-sorts needlessly. Preserving and propagating sort order through operators is half of why query optimizers are hard.

Frequently asked questions

When does the database planner pick sort-merge join over hash join?

When at least one input already arrives sorted on the join key — for example from a B+ tree index scan or as the output of an earlier sort or merge join — the sort phase is free and merge join wins. It's also preferred when the output must be sorted anyway (ORDER BY or a downstream merge), and when memory is too tight for a hash table but enough for an external sort.

Why does sort-merge join need to back up the right cursor?

When the join key has duplicates on both sides, every left row in a key group must match every right row in that group — a full cross-product. After advancing the left cursor within the group, you rewind the right cursor to the start of its matching run and replay it. Without this 'mark and restore' you would emit only the first pair of each group and silently drop rows.

What is the time complexity of sort-merge join?

If both inputs must be sorted, it is O(n log n + m log m) dominated by the two sorts, plus O(n + m) for the merge in the common case. If the inputs are already sorted, it drops to O(n + m). The worst case is O(n · m) when one key value matches every row on both sides, because the output is itself that large.

Can sort-merge join handle non-equi joins like a range condition?

Not directly. The synchronized single-pass merge relies on equality of the sort keys, so plain sort-merge join only does equi-joins. Inequality joins (t1.x < t2.y) need a band-join or nested-loop variant; hash join can't do them at all, which is one place merge-style joins still have an edge.

Does sort-merge join work for outer joins?

Yes. During the merge, when the left key is smaller than the right key you emit the left row padded with NULLs (for a left outer join) and advance left; the symmetric case handles right outer joins, and doing both handles a full outer join. This is a natural extension that hash join needs extra bookkeeping to match.

How does sort-merge join behave on disk versus in memory?

It is sequential-I/O friendly. Both sorts use external merge sort, which streams runs in large sequential reads and writes, and the final merge pass scans each input once front-to-back. That sequential access pattern is far kinder to spinning disks and SSDs than the random probes of a hash join whose hash table has spilled to disk.