Computer Graphics

Bounding Volume Hierarchy (BVH)

How one ray skips a million triangles to find the one it hits

A bounding volume hierarchy (BVH) is a tree of nested bounding boxes built over a scene's primitives, turning O(n) ray and collision queries into roughly O(log n) by letting one missed box prune an entire subtree.

Ray query (expected)O(log n)
Worst-case queryO(n)
Build (SAH)O(n log n)
Bounding shapeAABB (slab test)
Refit on motionO(n), bottom-up

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a BVH accelerates a query

A renderer tracing one ray against a scene of a million triangles, with no acceleration structure, does a million ray-triangle intersection tests — for every ray. A 1080p frame at one ray per pixel is two million rays, so that's two trillion intersection tests per frame. At even a few nanoseconds each, you're looking at minutes per frame. Brute force does not render anything in real time.

A bounding volume hierarchy fixes this by precomputing a tree. The leaves hold the actual primitives (triangles, spheres, whole meshes). Every internal node stores a bounding volume — almost always an axis-aligned bounding box, or AABB — that completely encloses everything beneath it. The root box wraps the whole scene; each child box wraps half of it; and so on down to the leaves.

The key invariant is one line of geometry: if a ray misses a node's box, it cannot possibly hit anything inside that box. So traversal becomes a guarded descent. Test the root. If the ray misses, the whole scene is rejected in one test. If it hits, test both children; recurse only into the ones the ray actually pierces. Each miss prunes an entire subtree — potentially hundreds of thousands of primitives — for the cost of a single box intersection.

Because each successful descent roughly halves the live candidate set, a balanced tree resolves a ray in on the order of log₂ n box tests plus a handful of leaf-primitive tests, instead of n. That's the whole idea: pay O(n log n) once at build time to make every subsequent query logarithmic.

Why axis-aligned boxes, and the slab test

The bounding volume could be a sphere, an oriented box, or a convex polytope. In practice almost everyone uses the AABB, for one reason: the per-node test is dirt cheap. An AABB is two corner points, min and max. A ray hits it iff the ray's entry and exit times across the three axis-aligned slabs overlap. That's the slab test — branch-free, no square roots, and it vectorizes:

// Ray: origin o, precomputed inverse direction invD = 1/d
function hitAABB(o, invD, min, max, tMin, tMax) {
  for (let a = 0; a < 3; a++) {
    let t0 = (min[a] - o[a]) * invD[a];
    let t1 = (max[a] - o[a]) * invD[a];
    if (invD[a] < 0) [t0, t1] = [t1, t0];   // handle negative direction
    tMin = t0 > tMin ? t0 : tMin;
    tMax = t1 < tMax ? t1 : tMax;
    if (tMax <= tMin) return false;          // slabs don't overlap → miss
  }
  return true;
}

Six multiplies, a few comparisons, no division (the reciprocal is precomputed once per ray). A tighter volume — an oriented bounding box or a 26-DOP — culls more rays per node, but the test costs several times more and the node is bigger in memory. AABBs win because the test is so cheap you can afford to visit more nodes; this is the same trade-off that makes a shallow, wide structure beat a deep, tight one in practice.

When to reach for a BVH

Ray and path tracing — the canonical use. Modern GPUs (NVIDIA RTX, AMD RDNA, Apple silicon) build and traverse BVHs in dedicated hardware.
Broad-phase collision detection — overlap two BVHs and recurse only where parent boxes touch, so most non-colliding pairs are rejected without ever comparing the underlying shapes.
Frustum and occlusion culling — reject whole branches of the scene graph that fall outside the camera in one box-vs-frustum test.
Dynamic and deforming scenes — unlike a kd-tree, a BVH refits cheaply when objects move, because each primitive stays in exactly one leaf.
Nearest-neighbour and range queries over spatial data, where the box hierarchy bounds the search radius.

If your scene is static and you want the absolute fastest traces, a kd-tree may edge out a BVH. If your data is points rather than extended objects, a kd-tree or octree is usually the cleaner fit. The BVH's sweet spot is extended primitives that move.

BVH vs other spatial acceleration structures

	BVH	kd-tree	Octree / Quadtree	Uniform grid	BSP tree
Partitions	Objects	Space (axis planes)	Space (regular octants)	Space (fixed cells)	Space (arbitrary planes)
Primitive in N nodes	Exactly 1	Possibly many (straddlers)	Possibly many	Possibly many	Possibly many
Sibling box overlap	Allowed	None (disjoint)	None	None	None
Build cost (good quality)	O(n log n) SAH	O(n log n), higher constant	O(n log n)	O(n)	O(n log n)+
Static ray-trace speed	Fast	Often fastest	Moderate	Poor for non-uniform	Moderate
Dynamic / deforming scenes	Refit in O(n) — best	Must rebuild	Rebuild / re-insert	Re-bin, cheap	Rebuild
Memory per primitive	~32–64 B/node, ~2n nodes	Higher (duplication)	Higher (empty cells)	Depends on density	High
Hardware support	Yes (RTX, RDNA, Metal)	No	No	No	No

The headline distinction is object-partitioning versus space-partitioning. A kd-tree splits space with planes, so a triangle crossing a plane lands in both children and gets tested twice — but the children never overlap, so a ray descends exactly one path. A BVH splits the objects, so every triangle lives in one leaf — but sibling boxes can overlap, so a ray may have to descend both children near a boundary. In practice the BVH's no-duplication, easy-refit properties make it the default for everything except static offline renderers.

What the numbers actually say

From O(n) to ~O(log n). For a 1-million-triangle scene, a brute-force trace is 1,000,000 ray-triangle tests per ray; a balanced BVH is on the order of log₂(10⁶) ≈ 20 box tests plus a few leaf tests. That's roughly a 50,000× reduction in tests per ray — the difference between minutes and milliseconds per frame.
SAH is worth ~1.5–2×. A tree built with the Surface Area Heuristic typically traces 1.5–2× faster than one built by naive median splitting, because it puts the cuts where rays are statistically least likely to need both children.
Node count ≈ 2n − 1. A binary BVH with n leaves has n − 1 internal nodes, so total storage is linear in the primitive count. At ~32 bytes per node, a 10M-triangle scene's BVH is roughly 640 MB — a real GPU-memory concern, which is why compressed and wide (4- or 8-ary) BVHs exist.
Refit is O(n) and embarrassingly parallel, versus an O(n log n) rebuild — which is why games refit most frames and only rebuild when box quality has drifted too far.
Tree depth ≈ log₂ n + slack. A million-leaf SAH tree is roughly 25–30 levels deep; the traversal stack is shallow enough to live in a fixed-size array, no heap allocation per ray.

JavaScript implementation

A compact median-split builder plus a recursive traversal. This builds the tree the simple way (split on the longest axis at the centroid median); swapping in SAH means replacing the one chooseSplit line.

// A primitive exposes a bounding box {min:[x,y,z], max:[x,y,z]} and a centroid.
function makeNode() { return { min:[ Infinity, Infinity, Infinity],
                               max:[-Infinity,-Infinity,-Infinity],
                               left:null, right:null, prims:null }; }

function expand(box, b) {           // grow box to include b's bounds
  for (let a = 0; a < 3; a++) {
    if (b.min[a] < box.min[a]) box.min[a] = b.min[a];
    if (b.max[a] > box.max[a]) box.max[a] = b.max[a];
  }
}

function build(prims) {
  const node = makeNode();
  for (const p of prims) expand(node, p.box);          // node bound = union of all
  if (prims.length <= 4) { node.prims = prims; return node; }   // leaf

  // Split on the axis with the widest centroid spread, at the median.
  const axis = widestAxis(prims);
  prims.sort((p, q) => p.centroid[axis] - q.centroid[axis]);
  const mid = prims.length >> 1;

  node.left  = build(prims.slice(0, mid));
  node.right = build(prims.slice(mid));
  return node;
}

// hitAABB (above) tests the box against the ray within [0, hit.t]; hitPrim
// returns a primitive hit distance or Infinity. `hit` is a shared record
// {t, prim} so the closest distance found so far prunes every later box.
function traverse(node, ray, hit = { t: Infinity, prim: null }) {
  if (hitAABB(ray.o, ray.invD, node.min, node.max, 0, hit.t) === false) return hit;  // box beyond closest hit → prune subtree
  if (node.prims) {                                            // leaf: test prims
    for (const p of node.prims) {
      const t = hitPrim(p, ray);
      if (t < hit.t) { hit.t = t; hit.prim = p; }
    }
    return hit;
  }
  // Visit the nearer child first; its hit shrinks hit.t and prunes the far box.
  const [near, far] = orderChildren(node, ray);
  traverse(near, ray, hit);
  traverse(far,  ray, hit);
  return hit;
}

Two details that matter for speed. First, orderChildren visits whichever child the ray enters first; once you've found a hit in the near child, any far-child box whose entry distance exceeds the current closest hit can be pruned outright. Second, real engines flatten this recursion into an array with a small explicit stack — pointer-chasing a tree per ray wrecks cache locality on a GPU.

Python implementation

The same algorithm, plus a SAH-flavoured split that scores candidate cut points by surface area times primitive count.

import math

class Node:
    __slots__ = ('lo', 'hi', 'left', 'right', 'prims')
    def __init__(self):
        self.lo = [math.inf]*3
        self.hi = [-math.inf]*3
        self.left = self.right = self.prims = None

def expand(node, box):
    lo, hi = box
    for a in range(3):
        node.lo[a] = min(node.lo[a], lo[a])
        node.hi[a] = max(node.hi[a], hi[a])

def surface_area(lo, hi):
    d = [hi[a] - lo[a] for a in range(3)]
    return 2 * (d[0]*d[1] + d[1]*d[2] + d[2]*d[0])

def build(prims):
    node = Node()
    for p in prims:
        expand(node, p.box)
    if len(prims) <= 4:
        node.prims = prims
        return node

    # SAH: try splits along the widest axis; pick the lowest-cost cut.
    axis = max(range(3), key=lambda a: node.hi[a] - node.lo[a])
    prims.sort(key=lambda p: p.centroid[axis])
    best_cost, best_i = math.inf, len(prims) // 2
    n = len(prims)
    for i in range(1, n):                       # split prims[:i] | prims[i:]
        l_lo = [min(p.box[0][a] for p in prims[:i]) for a in range(3)]
        l_hi = [max(p.box[1][a] for p in prims[:i]) for a in range(3)]
        r_lo = [min(p.box[0][a] for p in prims[i:]) for a in range(3)]
        r_hi = [max(p.box[1][a] for p in prims[i:]) for a in range(3)]
        cost = (surface_area(l_lo, l_hi) * i +
                surface_area(r_lo, r_hi) * (n - i))
        if cost < best_cost:
            best_cost, best_i = cost, i

    node.left  = build(prims[:best_i])
    node.right = build(prims[best_i:])
    return node

The naive SAH sweep above is O(n²) per node because it recomputes child bounds for every candidate split. Production builders fix this with a single sorted sweep that accumulates prefix/suffix box areas, dropping each node's split search to O(n) and the whole build to O(n log n). Faster still: bin the centroids into 8–16 buckets and only evaluate splits at bucket boundaries.

Variants worth knowing

SAH BVH. The quality baseline. Score every candidate split by SA(left)·N(left) + SA(right)·N(right) and take the minimum. The surface area approximates the probability a uniformly random ray enters that box. SAH trees trace 1.5–2× faster than median-split trees.

LBVH (Linear BVH). Sort primitives along a Morton / Z-order curve and build the tree by splitting on the highest differing bit. It's O(n) after the sort and trivially parallel on a GPU, at the cost of somewhat lower tree quality than SAH.

Wide BVH (BVH4 / BVH8). Instead of two children per node, store four or eight. Fewer levels means less pointer chasing and lets SIMD/GPU test all children of a node at once. This is the layout RTX and RDNA hardware actually traverse.

Two-level BVH (TLAS / BLAS). Each mesh gets its own bottom-level BVH (a BLAS) in object space; a top-level BVH (a TLAS) holds one box per instance with its transform. Move an instance and you rebuild only the cheap TLAS — this is how an animated scene with thousands of rigid instances stays fast, and it's exactly the abstraction the Vulkan and DirectX ray-tracing APIs expose.

Refit + rebuild. For deforming geometry, refit the existing topology bottom-up each frame (O(n)) and trigger a full rebuild only when a quality metric — total node surface area versus the last rebuild — crosses a threshold.

Common bugs and edge cases

Forgetting to widen boxes for floating-point error. A leaf box that exactly hugs its triangle can report a miss for a ray that grazes the surface, because the slab test rounds the wrong way. Pad leaf bounds by a few ULPs (or use conservative rounding) so the box never excludes a real hit.
Not pruning the far child with the current closest hit. If you traverse both children unconditionally, you still test every box even after you've found a hit. Pass the closest t down and reject any box whose entry distance is farther.
Visiting children in the wrong order. Descending the far child first means the near hit can't prune it. Order by the ray's entry distance into each child's box.
Degenerate centroid-coincident splits. When many primitives share a centroid (e.g. a fan of coplanar triangles), the median split puts them all on one side and the tree degenerates toward O(n). Fall back to an object-median or spatial split for that node.
Overlapping sibling boxes inflating cost. Because BVH siblings can overlap, a ray near a boundary descends both. Large overlap is the classic symptom of a bad split — SAH exists precisely to minimize it.
Refitting forever without rebuilding. Refit preserves topology, so after enough motion the boxes balloon and overlap heavily; traversal slowly degrades to brute force. Track box quality and rebuild when it drifts.
Recursive traversal on the GPU. GPUs have no real call stack per thread; a recursive traverser either crashes or serializes. Flatten the tree to an array and walk it with a fixed-size local stack.

Frequently asked questions

Why is a BVH roughly O(log n) instead of O(n)?

A ray that misses a node's bounding box cannot hit anything inside it, so the traversal skips that node's entire subtree with a single test. Each successful descent roughly halves the candidate set, so a balanced tree of n primitives needs on the order of log n box tests per ray instead of n primitive tests. The bound is expected, not worst-case — a degenerate scene or a ray that grazes many overlapping boxes can still cost O(n).

What is the difference between a BVH and a kd-tree?

A BVH partitions the objects: each primitive lives in exactly one leaf, and sibling boxes may overlap in space. A kd-tree partitions the space: a primitive that straddles a splitting plane is referenced by both sides. BVHs handle dynamic and deforming scenes better because you can refit boxes without re-splitting; kd-trees often trace static scenes slightly faster but cost more to build and store.

What is the Surface Area Heuristic (SAH)?

The SAH estimates the cost of a candidate split as the probability a ray hits each child box — proportional to the child's surface area — times the number of primitives it holds. Builders evaluate many split planes and pick the one with the lowest estimated cost. SAH trees trace 1.5–2× faster than median-split trees, at the price of a more expensive build.

Why are bounding boxes axis-aligned (AABB) instead of tight-fitting?

An axis-aligned box stores just two corner points and tests against a ray with the branch-free slab method in a handful of FLOPs. Tighter volumes — oriented boxes, spheres, k-DOPs — prune more rays but cost more per test and more memory. AABBs win because the per-node test is so cheap that you can afford to visit more of them.

How do you update a BVH when objects move?

For small motion, refit: walk the tree bottom-up and re-expand each parent box to enclose its (moved) children, keeping the topology — O(n) and trivially parallel. Refitting degrades quality as boxes drift apart, so engines periodically rebuild from scratch or use incremental rotations. Fully rebuilding every frame is common for fast-deforming geometry.

Are BVHs only for ray tracing?

No. The same tree of nested boxes accelerates broad-phase collision detection (overlap two BVHs and recurse only where parent boxes touch), frustum culling, nearest-neighbour and range queries, and physics sweep tests. Ray tracing is the highest-profile use because GPUs now build and traverse BVHs in hardware, but the structure predates real-time ray tracing by decades.