Spatial Data Structures

R-Tree

Nested bounding rectangles — the GIS workhorse

An R-tree indexes spatial objects under nested minimum bounding rectangles. Queries prune subtrees whose MBR misses the search rect. Branching M=50–100 gives O(log_M N). PostGIS runs on it.

  • Range queryO(log_M N + k)
  • InsertO(log_M N)
  • Branching M (page-sized)50–100 typical
  • Disk-friendlyOne node = one page
  • Handles arbitrary geometryYes — via MBR approximation
  • BalancedYes — all leaves at same depth

Interactive visualization

Watch nested bounding rectangles cluster objects, then see a query prune entire subtrees in one step.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How an R-tree works

Imagine a map of a country with ten million landmarks — every shop, river, park, road, building. You want to know: "What's inside this map viewport right now?" A linear scan visits all ten million; the viewport contains maybe a hundred. The R-tree, invented by Antonin Guttman in 1984, makes that query log-time by grouping geometries hierarchically under their bounding rectangles.

Three rules govern the structure:

  1. Every node has between m and M entries (with m ≈ M/2, and M chosen so a node fits in one disk page).
  2. Every entry is a bounding rectangle plus a pointer. In leaves, the pointer goes to a geometry. In internal nodes, the pointer goes to a child node, and the bounding rectangle is the MBR of everything in that subtree.
  3. All leaves are at the same depth. The tree is balanced — height is O(log_M N).

That's enough to make range queries fast. To answer "what intersects rectangle Q?", start at the root. For each child entry, test if its MBR intersects Q. If not, skip the entire subtree — possibly thousands of geometries pruned with one rectangle-rectangle test. If yes, recurse. At a leaf, test Q against the actual geometry (the MBR is only an approximation).

Building the tree

R-trees are built incrementally by insertion or in bulk by Sort-Tile-Recursive (STR) packing. The incremental insert algorithm:

  1. Choose subtree. Starting at root, pick the child whose MBR needs the smallest enlargement to contain the new entry. Ties broken by smaller MBR area.
  2. Descend to a leaf.
  3. Insert into the leaf. If the leaf is now overfull (>M entries), split it.
  4. Adjust ancestors — every ancestor's MBR may need to expand, and a split propagates up until a non-overfull node absorbs it (or the root splits and the tree grows by one level).

Splitting is where R-tree variants differ. The original Guttman R-tree uses a quadratic-cost heuristic to minimize the total area of the two resulting MBRs. The R*-tree (Beckmann 1990) uses a smarter splitter that also minimizes overlap and perimeter — giving 20–50% faster queries on real GIS data at the cost of more expensive inserts. STR packing builds the whole tree at once from a sorted list, giving the densest tree at the cost of being a batch operation.

When to use an R-tree

  • Geospatial range queries. "Find every POI in this map viewport" — exactly what R-trees were designed for. PostGIS uses GiST + R-tree-like operator class as the default spatial index.
  • Bounding-box collision broad-phase. Some physics engines and CAD systems index rigid bodies' AABBs in an R-tree so each frame's narrow-phase only considers overlapping pairs.
  • Disk-based spatial data. R-trees beat tries and k-d trees on disk because each node is a page; one I/O pulls all the children. With M=50, even a billion-row table is 5 levels deep — five page reads.
  • Tile servers and map rendering. Mapbox Vector Tiles, Cesium 3D Tiles, and most map renderers query an R-tree per zoom level to pick visible geometries.
  • Multidimensional indexing beyond 2D. R-trees generalize to any dimension by using axis-aligned bounding boxes. PostGIS uses a GIST R-tree for 3D geometry (geom Z, M coordinates) and even time intervals.

R-tree vs k-d tree vs grid vs quadtree

R-treek-d treeGridQuadtree
Geometry typeAny (rect, polygon)PointsAnyAny (assigned by center)
Range queryO(log_M N + k)O(√N + k) for 2DO(cells × density)O(log N + k)
BalancedYes — alwaysOnce at build; degrades on insertN/A (flat)No — depth varies
Disk pagesOne node per pageHard to mapLinear filePossible (linear quadtree)
Dynamic updatesO(log_M N) — supports itRebuild recommendedO(1) bucket swapO(log N)
Overlapping nodesAllowed (must descend into multiple)No — strict partitionCells overlap on big objectsNo
Best forGIS, mixed shapes, on-diskStatic points, low-D NNUniform density, fixed scaleSparse points, 2D screen-space
Used byPostGIS, Oracle Spatialscikit-learn kNN, PCLVoxel games, NPC AIImage compression, games

Rule of thumb: R-tree for production GIS (you want disk-friendly, balanced, polygon-capable). k-d tree for in-memory nearest-neighbor over points. Grid for uniform dynamic point clouds. Quadtree for hierarchical screen-space.

Pseudo-code — range query

// Query rectangle Q; return all entries intersecting Q.

search(node, Q):
    results = []
    if node.isLeaf:
        for entry in node.entries:
            if intersects(entry.mbr, Q):
                if intersects(entry.geometry, Q):     // exact test
                    results.append(entry)
    else:
        for child in node.children:
            if intersects(child.mbr, Q):              // prune by MBR
                results.extend(search(child, Q))
    return results

// Insert a new entry e with rectangle e.mbr.

insert(root, e):
    leaf = chooseLeaf(root, e.mbr)
    leaf.entries.append(e)
    if len(leaf.entries) > M:
        a, b = split(leaf)                            // R* variant minimizes overlap
        propagateUp(a, b)
    else:
        adjustMBRs(leaf, e.mbr)

JavaScript implementation

// Minimal R-tree with quadratic split. For production, use RBush or rbush.
// MBR is { minX, minY, maxX, maxY }.

class RTreeNode {
  constructor(isLeaf = true) { this.entries = []; this.isLeaf = isLeaf; this.mbr = null; }
}

function intersects(a, b) {
  return a.minX <= b.maxX && a.maxX >= b.minX &&
         a.minY <= b.maxY && a.maxY >= b.minY;
}
function area(r) { return (r.maxX - r.minX) * (r.maxY - r.minY); }
function combined(a, b) {
  return { minX: Math.min(a.minX, b.minX), minY: Math.min(a.minY, b.minY),
           maxX: Math.max(a.maxX, b.maxX), maxY: Math.max(a.maxY, b.maxY) };
}
function enlargement(a, b) { return area(combined(a, b)) - area(a); }

class RTree {
  constructor(M = 9) { this.M = M; this.m = Math.ceil(M * 0.4); this.root = new RTreeNode(true); }

  search(rect, node = this.root, out = []) {
    if (!node.mbr || !intersects(node.mbr, rect)) {
      if (node !== this.root) return out;  // pruned
    }
    if (node.isLeaf) {
      for (const e of node.entries) if (intersects(e.mbr, rect)) out.push(e);
    } else {
      for (const child of node.entries)
        if (!child.mbr || intersects(child.mbr, rect)) this.search(rect, child, out);
    }
    return out;
  }

  chooseLeaf(node, mbr) {
    if (node.isLeaf) return node;
    let best = null, bestE = Infinity, bestA = Infinity;
    for (const child of node.entries) {
      const e = enlargement(child.mbr, mbr);
      const a = area(child.mbr);
      if (e < bestE || (e === bestE && a < bestA)) { best = child; bestE = e; bestA = a; }
    }
    return this.chooseLeaf(best, mbr);
  }

  insert(entry) {
    const leaf = this.chooseLeaf(this.root, entry.mbr);
    leaf.entries.push(entry);
    this._updateMBR(leaf);
    if (leaf.entries.length > this.M) this._split(leaf);
  }

  _updateMBR(node) {
    if (node.entries.length === 0) { node.mbr = null; return; }
    let r = { ...node.entries[0].mbr };
    for (let i = 1; i < node.entries.length; i++) r = combined(r, node.entries[i].mbr);
    node.mbr = r;
  }

  // Quadratic split: pick two seeds whose combined MBR area is worst, distribute the rest.
  _split(node) {
    const E = node.entries;
    let s1 = 0, s2 = 1, worst = -Infinity;
    for (let i = 0; i < E.length; i++) for (let j = i + 1; j < E.length; j++) {
      const d = area(combined(E[i].mbr, E[j].mbr)) - area(E[i].mbr) - area(E[j].mbr);
      if (d > worst) { worst = d; s1 = i; s2 = j; }
    }
    const A = new RTreeNode(node.isLeaf); A.entries.push(E[s1]);
    const B = new RTreeNode(node.isLeaf); B.entries.push(E[s2]);
    for (let i = 0; i < E.length; i++) {
      if (i === s1 || i === s2) continue;
      const enlA = enlargement(this._mbr(A), E[i].mbr);
      const enlB = enlargement(this._mbr(B), E[i].mbr);
      (enlA < enlB ? A : B).entries.push(E[i]);
    }
    this._updateMBR(A); this._updateMBR(B);
    // ... propagate up (omitted for brevity; rewire parent or grow root)
  }

  _mbr(n) { this._updateMBR(n); return n.mbr; }
}

For production JavaScript work, use the RBush library — STR-packed bulk loading, R*-style split heuristics, and battle-tested in Mapbox GL and OpenLayers.

Python implementation

from dataclasses import dataclass
from typing import List, Optional, Tuple

@dataclass
class MBR:
    minX: float; minY: float; maxX: float; maxY: float

    def intersects(self, other):
        return (self.minX <= other.maxX and self.maxX >= other.minX and
                self.minY <= other.maxY and self.maxY >= other.minY)

    def area(self):
        return (self.maxX - self.minX) * (self.maxY - self.minY)

    def combined(self, other):
        return MBR(min(self.minX, other.minX), min(self.minY, other.minY),
                   max(self.maxX, other.maxX), max(self.maxY, other.maxY))


@dataclass
class Entry:
    mbr: MBR
    payload: object


class RTree:
    def __init__(self, M=9):
        self.M = M
        self.root = {'is_leaf': True, 'entries': [], 'mbr': None}

    def search(self, rect: MBR, node=None, out=None) -> List[Entry]:
        node = node or self.root
        out = out if out is not None else []
        if node['is_leaf']:
            for e in node['entries']:
                if rect.intersects(e.mbr):
                    out.append(e)
        else:
            for child in node['entries']:
                if rect.intersects(child['mbr']):
                    self.search(rect, child, out)
        return out

    def insert(self, entry: Entry):
        leaf = self._choose_leaf(self.root, entry.mbr)
        leaf['entries'].append(entry)
        self._update_mbr(leaf)
        # split + propagate omitted for brevity

    def _choose_leaf(self, node, mbr):
        if node['is_leaf']:
            return node
        best = min(node['entries'],
                   key=lambda c: c['mbr'].combined(mbr).area() - c['mbr'].area())
        return self._choose_leaf(best, mbr)

    def _update_mbr(self, node):
        if not node['entries']:
            node['mbr'] = None; return
        boxes = [e.mbr if isinstance(e, Entry) else e['mbr'] for e in node['entries']]
        r = boxes[0]
        for b in boxes[1:]:
            r = r.combined(b)
        node['mbr'] = r

For real Python work, use Rtree (libspatialindex bindings), GeoPandas's spatial index, or PostGIS's GiST index via psycopg2. These give R*-tree heuristics and bulk loading out of the box.

Variants

  • Guttman R-tree (1984). The original. Quadratic-cost split heuristic minimizing total MBR area. Solid but not optimal.
  • R*-tree (Beckmann 1990). Refined splits and chooseSubtree heuristics; also reinserts the worst entries after a split. 20–50% faster queries; the modern default.
  • R+-tree (Sellis 1987). No overlap between sibling MBRs — geometries that span boundaries are duplicated into each overlapping leaf. Faster queries; more storage and more complex updates.
  • Hilbert R-tree (Kamel 1994). Uses Hilbert space-filling curve to order entries before insertion, giving consistently tight MBRs. Excellent for static data.
  • STR-packed R-tree (Leutenegger 1997). Bulk-load by sorting on x, splitting into strips, sorting each strip on y. The fastest way to build a static R-tree from known data; used by RBush.
  • Priority R-tree (Arge 2004). Provably optimal worst-case query time. Rarely used in practice — the R*-tree is fast enough and simpler.

Performance — costed claims

  • Query cost. An R-tree with branching factor M holds N entries in a tree of height ⌈log_M(N)⌉. With M=50 and N=10⁶, that's 4 levels. Each level visits a few subtrees on average, so a range query touches ~20 nodes — 20 page reads on disk, sub-millisecond on SSD.
  • Build time. STR-packing N entries is O(N log N) — sort by x, partition, sort each partition by y, partition again. Building a 10M-entry tree takes 1–5 seconds in RBush; incremental insertion takes much longer.
  • R*-tree advantage. The R*-tree's split heuristic produces 20–50% smaller average MBR overlap on real datasets (Beckmann's original paper measured this on TIGER street data). Queries are correspondingly faster.
  • PostGIS GiST R-tree benchmark. A spatial range query on a 10M-row table with a GiST index typically runs in 1–10 ms; the same query on an unindexed table takes 5–30 seconds — three to four orders of magnitude.
  • Memory per entry. Roughly 40–80 bytes (4 floats for MBR + pointer + metadata). At M=50 with N=10⁶, total node memory is ~80 MB plus the geometries themselves.

Common bugs and edge cases

  • Degenerate MBRs. A point's MBR has zero area; many split heuristics divide by area and produce NaN. Always handle the zero-area case explicitly or seed insertion order.
  • Overlapping MBRs at the root. If sibling MBRs overlap heavily, queries descend into multiple subtrees and lose the log factor. The R*-tree mitigates this; the original Guttman tree does not.
  • Skewed data degradation. A million points concentrated in a city plus a thousand scattered across the country produce wildly different MBR densities. Use STR-packing on bulk data, or use the R*-tree's reinsertion mechanism.
  • Forgetting the exact intersection test. The MBR is only an approximation — a point inside the MBR of a polygon isn't necessarily inside the polygon. Always re-test against the actual geometry at the leaf.
  • Concurrent updates. Inserts can cascade splits up to the root, holding write locks at every level. Naive implementations bottleneck under contention. Production systems use B-link-style optimistic concurrency or lock-coupling.
  • Page size mismatch. If your nodes don't fit in one disk page, every leaf access becomes two I/Os. Compute M from the page size and entry size precisely; common M values are 50, 60, or 100.

Frequently asked questions

What is an R-tree and what's it for?

An R-tree is a balanced spatial index. Each leaf stores up to M geometries (rectangles, polygons, anything with a bounding box). Each internal node stores up to M bounding boxes, each pointing at a child whose contents fit inside it. A range query — 'give me every shape that intersects this rectangle' — descends only into subtrees whose MBR overlaps the query. Used wherever you need spatial range, nearest-neighbor, or intersection queries: PostGIS, MySQL Spatial, Oracle Spatial, MongoDB 2dsphere, SQLite R*Tree.

How does the R-tree differ from a k-d tree or quadtree?

Quadtrees and k-d trees partition space — every point lives in exactly one region. R-trees partition objects — bounding boxes may overlap, and a query may need to descend into multiple sibling subtrees if their MBRs both intersect the query. The trade-off: R-trees handle rectangles and polygons natively (no need to splat geometry into multiple cells) and they're disk-friendly (each node fits in one page). Quadtrees beat them on uniformly-distributed point data; R-trees beat them on real-world geometries.

What is M and why does it matter?

M is the maximum number of entries per node — typically chosen so one node fills one disk page (4–8 KB). For a typical 4 KB page and 64-byte entries, M is around 50–100. Query cost is O(log_M N), so larger M means a shallower tree and fewer disk reads. There's also a minimum m (often M/2 or 0.4·M) that triggers rebalancing on deletion.

What's an R*-tree and why is it better?

The R*-tree (Beckmann et al., 1990) refines the original R-tree's heuristics for splitting overfull nodes and choosing insertion subtrees. It minimizes total MBR area, perimeter, and overlap simultaneously. Result: 20–50% faster queries on real datasets at the cost of more expensive inserts. PostGIS and most modern systems implement the R*-tree, not the original Guttman R-tree.

Why are R-trees the default in geospatial databases?

Three reasons: they're balanced (the tree height stays logarithmic), they handle arbitrary geometries via MBR approximation, and they map naturally to disk pages. A 10-million-row PostGIS table with an R-tree GiST index can answer a viewport range query in 1–10 ms — same as a B-tree lookup on a non-spatial column. The shape is irrelevant to the index; only the bounding box matters.

How does the MBR query actually prune?

At each node, test the query rectangle against every child MBR. If a child's MBR doesn't intersect, skip its entire subtree — possibly thousands of geometries. If it does, recurse. At leaves, do an exact intersection test against the actual geometry (the MBR is only an approximation; a point inside the MBR isn't necessarily inside the underlying polygon).

Can R-trees do nearest-neighbor search?

Yes — using a best-first priority queue keyed by minimum distance from the query point to each node's MBR. Pop the closest node, expand its children, push them with their min distances, and repeat. Stop when the top of the queue is a real geometry. This is the Roussopoulos-Hjaltason-Kelley algorithm; k-nearest-neighbor variants extend it naturally. PostGIS exposes this as the '<->' operator.