Space-Filling Curves
Morton Curve (Z-Order)
Interleave x and y bits — get a 1D index that preserves 2D locality
The Morton curve, or Z-order, maps multi-dimensional points to a 1D integer by interleaving the bits of the coordinates. Points near each other in 2D share long prefixes — exactly why GPU textures, octrees, and geospatial indexes use it.
- Encodinginterleave bits
- Encode costO(log n) shifts
- Localitygood, jumpy at cell edges
- Hardwarex86 PDEP/PEXT, ~3 ns
- InventedMorton 1966, IBM
- Used inoctrees, GPU swizzling, geohashing
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How bit interleaving works
Morton order is the simplest way to turn a 2D (or 3D, or n-D) coordinate into a 1D integer while preserving most of the spatial relationships. The recipe: take the binary representation of each coordinate, then weave the bits together.
For a 2D point with x and y coordinates, bit i of x lands at position 2i of the output; bit i of y lands at position 2i+1. Walking through the example from the lede — point (3, 5) — gives:
x = 3 = 0 1 1 (bit2 bit1 bit0)
y = 5 = 1 0 1 (bit2 bit1 bit0)
interleave (low→high): y0 x0 y1 x1 y2 x2
1 1 0 1 1 0
= 0b011011
= 27
So Morton(3, 5) = 27. Decoding is the inverse: take every other bit of the Morton code to recover x and y separately. The whole encoding–decoding pair is a half-dozen instructions on modern CPUs.
The Z shape
If you list the Morton codes for every cell in a 4×4 grid and connect them in order, you get a path. The path follows nested Z shapes — bottom-left, bottom-right, top-left, top-right within each 2×2 block, then the four 2×2 blocks are visited in the same Z order.
4×4 grid with Morton indices:
10 11 14 15
8 9 12 13
2 3 6 7
0 1 4 5
Path: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8 → ... → 15
└─ Z ─┘ └─ Z ─┘ └─ Z ─┘ └─ Z ─┘
└────── Z ──────┘
└────── Z ──────┘
The Z is fractal: nest a 4×4 Z inside each cell of a larger 4×4 Z and you get the 16×16 Z, and so on. The same construction works in 3D — the curve traces nested N shapes through cube octants.
Why locality is mostly preserved
Two cells whose Morton codes share the top 2k bits lie inside the same 2n−k × 2n−k sub-block (for an n-bit-per-coordinate grid). So sorting by Morton code clusters cells by spatial vicinity. Range queries become contiguous-ish scans; nearest-neighbor queries focus on a handful of nearby ranges.
The catch: the cell after the last cell of a Z block can be diagonally opposite the last cell. The transition 3 → 4 in the 4×4 grid above jumps from the top-right corner of the bottom-left 2×2 block to the bottom-left corner of the bottom-right 2×2 block — adjacent in 1D but distance 2 apart in 2D. The Hilbert curve fixes this by always stepping to a neighboring cell; Morton trades that off for cheaper encoding.
Fast encoding: magic numbers
The textbook implementation interleaves bits with a sequence of shifts and masks:
// Spread the low 16 bits of x into the even positions of a 32-bit result.
uint32_t spread_bits(uint32_t x) {
x &= 0x0000FFFF;
x = (x | (x << 8)) & 0x00FF00FF;
x = (x | (x << 4)) & 0x0F0F0F0F;
x = (x | (x << 2)) & 0x33333333;
x = (x | (x << 1)) & 0x55555555;
return x;
}
uint32_t morton2D(uint16_t x, uint16_t y) {
return spread_bits(x) | (spread_bits(y) << 1);
}
Each shift-and-mask step doubles the spacing between the bits we care about and zeros out everything else. After 5 such steps the 16 input bits sit at positions 0, 2, 4, ... 30 with zero gaps between. Combined cost: about 10 instructions, ~3 ns on Skylake.
On x86 with the BMI2 extension, the PDEP (parallel-deposit) instruction does the spread in one instruction:
#include <immintrin.h>
uint32_t morton2D_bmi2(uint16_t x, uint16_t y) {
return _pdep_u32(x, 0x55555555u) | _pdep_u32(y, 0xAAAAAAAAu);
}
Two instructions, one cycle latency each. The fastest possible encoder for 2D Morton codes on modern hardware.
Application: GPU texture swizzling
A GPU samples a 2D texture by interpolating from four neighboring texels. If texels are stored in row-major order (the usual array layout), four neighboring texels span addresses that differ by either ±1 or ±width — meaning the rows hit two distinct cache lines. With Morton-order layout (often called texture swizzling), the four neighbors usually sit on the same cache line. The cache hit rate jumps from roughly 50% to 95% for bilinear filtering of mipmapped textures.
Modern GPUs (AMD GCN, NVIDIA Maxwell+) store textures in Morton order automatically — the driver writes them swizzled when you call glTexImage2D. The optimization is invisible to graphics programmers, but it's why GPU memory bandwidth efficiency is high.
Application: linear octrees
An octree stores 3D points by recursively subdividing space into octants. The naive representation is a tree of nodes, each holding 8 child pointers. A linear octree skips the pointers: each leaf is identified by its Morton code (interleaving x, y, z bits), and the whole tree is just a sorted array of those codes.
Range queries — "give me every point inside this AABB" — become binary searches on the Morton-sorted array. Neighbor queries become bit operations on the Morton code. Point Cloud Library, CGAL, and the spatial indexes inside game engines like Unreal use linear octrees with Morton encoding because they're half the memory of pointer-based octrees and friendlier to GPU traversal.
Application: geohashing
Geohashes (used by Elasticsearch, Redis Geo, MongoDB 2dsphere) are essentially Morton codes over latitude and longitude, base-32-encoded for string-friendliness. A 12-character geohash specifies a position to ±3.7 cm. Range queries on a B-tree of geohashes return contiguous prefixes — the foundation of fast "find restaurants within 1 km" queries in production databases.
The same trick works in any spatial database: encode (x, y) or (x, y, z) coordinates as a Morton code, build a B-tree on the code, and you get range scans that are mostly local. Postgres's btree_gist extension and SQL Server's spatial indexes both use this strategy.
Morton vs Hilbert vs row-major
| Row-major | Morton (Z-order) | Hilbert | |
|---|---|---|---|
| Encode time | i = y·width + x | ~3 ns (BMI2) | ~20 ns (state machine) |
| Decode time | x = i % w; y = i / w | ~3 ns (BMI2) | ~20 ns |
| Adjacent in 1D ⇒ adjacent in 2D | Same row only | Usually, but jumps | Always |
| Range scan locality | Strips of rows | Z-shape clusters | Smoothly contiguous |
| Used in | Plain 2D arrays | Octrees, GPU swizzle, geohash | R-tree indexing, DB |
| Cache hit rate (bilinear) | ~50% | ~92% | ~95% |
| Code length | 1 line | ~10 lines | ~50 lines |
| Generalizes to n-D | No | Trivially | Yes but complex |
If you need pure encoding speed, pick Morton. If you need range-scan locality, pick Hilbert. If your access pattern is row-strided already, plain row-major wins.
Python implementation
def morton2d_encode(x, y):
"""Interleave bits of x and y. x, y assumed < 2**16."""
def spread(v):
v &= 0xFFFF
v = (v | (v << 8)) & 0x00FF00FF
v = (v | (v << 4)) & 0x0F0F0F0F
v = (v | (v << 2)) & 0x33333333
v = (v | (v << 1)) & 0x55555555
return v
return spread(x) | (spread(y) << 1)
def morton2d_decode(m):
def compact(v):
v &= 0x55555555
v = (v | (v >> 1)) & 0x33333333
v = (v | (v >> 2)) & 0x0F0F0F0F
v = (v | (v >> 4)) & 0x00FF00FF
v = (v | (v >> 8)) & 0x0000FFFF
return v
return compact(m), compact(m >> 1)
# Verify
assert morton2d_encode(3, 5) == 27
assert morton2d_decode(27) == (3, 5)
print("OK")
3D Morton code
Extending to 3D just changes the masks: bit i of x goes to position 3i, bit i of y to 3i+1, bit i of z to 3i+2. The spread function needs to leave two-bit gaps instead of one. Linear octrees and most volumetric renderers use the 3D variant.
uint64_t spread3(uint32_t v) {
uint64_t x = v & 0x1FFFFF;
x = (x | (x << 32)) & 0x1F00000000FFFFULL;
x = (x | (x << 16)) & 0x1F0000FF0000FFULL;
x = (x | (x << 8)) & 0x100F00F00F00F00FULL;
x = (x | (x << 4)) & 0x10C30C30C30C30C3ULL;
x = (x | (x << 2)) & 0x1249249249249249ULL;
return x;
}
uint64_t morton3D(uint32_t x, uint32_t y, uint32_t z) {
return spread3(x) | (spread3(y) << 1) | (spread3(z) << 2);
}
Common pitfalls
- Believing the curve is continuous. Morton is discrete — there's no "in-between" point. Consecutive indices can be diagonal jumps. Don't use Morton for tasks that need adjacency on every step (Hilbert is what you want).
- Non-power-of-two grids. Bit interleaving assumes 2^n × 2^n. For 1080×1920 textures the Morton code "extends" the grid to the next power of two; you must check bounds explicitly or pad.
- Overflow at high resolutions. A 16-bit x and 16-bit y produce a 32-bit Morton code. Going to 24-bit coords needs a 48-bit code — use uint64_t.
- Computing neighbor codes naively. Adding 1 to a Morton code doesn't give the next neighbor in 2D. To increment x by 1, isolate the x-bits, increment, and reinterleave — or use Karras's bit-twiddling shortcut: add 0x55555555 with carries masked to even positions.
- Comparing Morton codes for spatial distance. Morton index difference is not a good distance metric — points near a Z boundary can have large index gaps. Use Euclidean distance on decoded coordinates.
- Forgetting Morton order has different access patterns than row-major. A loop
for i in range(n²): process(image[i])in Morton order processes pixels in Z-shape jumps. The output of an image filter doesn't look natural unless you decode back to (x, y) on every step.
Frequently asked questions
How does the Morton code interleave bits?
Take the binary representation of x and y. Bit i of x becomes bit 2i of the Morton code; bit i of y becomes bit 2i+1. For x = 3 = 0b011 and y = 5 = 0b101, interleaving gives bits (from low to high) y0 x0 y1 x1 y2 x2 = 1 1 0 1 1 0 = 0b011011 = 27. So Morton(3, 5) = 27. The classic fast implementation uses Magic Numbers: a sequence of shifts and masks that spread an n-bit integer into a 2n-bit interleaved layout in O(log n) operations.
Why is it called Z-order?
If you draw the path that consecutive Morton indices trace through a 2D grid — 0, 1, 2, 3, then 4, 5, 6, 7, and so on — the path looks like nested Z shapes. Within each 2×2 cell the Z goes: bottom-left, bottom-right, top-left, top-right. Then the four 2×2 cells of a 4×4 grid are visited in the same Z order. Then the four 4×4 cells in an 8×8 grid. The fractal Z structure repeats at every scale, which is why the curve is also called the Z-order curve or N-order curve.
Why does Morton order give good locality?
Two points that share a long Morton-code prefix sit inside the same small bounding box. Specifically, if Morton(p) and Morton(q) agree on the top 2k bits, then p and q sit inside the same k × k cell. So sorting points by Morton index roughly groups spatially nearby points. The grouping isn't perfect — adjacent points across a cell boundary can have wildly different Morton codes (this is the "jump" problem the Hilbert curve fixes) — but it's good enough that GPU texture cache hit rates improve dramatically when texels are stored in Morton order rather than row-major.
What is a linear octree?
A linear octree stores 3D spatial data not as a tree of nodes but as a sorted array of Morton codes. Each leaf cell has a 3-bit-per-level code identifying its path from the root. The whole tree is just the sorted list of codes for occupied cells. Range queries become binary searches; neighbor queries become a few bit operations on the code. Linear octrees use roughly half the memory of pointer-based octrees and are friendly to GPU traversal. PCL (Point Cloud Library), CGAL, and most game-engine spatial indexes use linear octrees with Morton-order encoding.
Morton vs Hilbert curve — which is better?
Hilbert has strictly better locality: any two consecutive indices on a Hilbert curve are spatially adjacent. Morton has jumps — the index after 3 in a 4×4 grid is 4, but those two cells are at opposite corners. The trade-off is encoding cost: Morton's interleave-bits encoding takes 10 ns on modern x86 (using PDEP/PEXT instructions), while Hilbert encoding needs a state machine with ~20 ns per coordinate pair. For workloads dominated by indexing (GPU textures, batch geometry), Morton wins on speed. For workloads dominated by range scans (geospatial databases), Hilbert wins on cache hit rate.
Can Morton codes be computed branchlessly?
Yes — and the standard implementation is a sequence of bitwise shifts and masks. For 32-bit input x: x = (x | (x << 16)) & 0x0000FFFF0000FFFF, x = (x | (x << 8)) & 0x00FF00FF00FF00FF, then continuing with shifts of 4, 2, 1 and the appropriate masks 'spreads' the bits with one zero gap between each. Repeat for y and OR the two together (with y shifted by one). On x86 with BMI2 the PDEP instruction does the spread in a single instruction. The whole 64-bit Morton encode is ~3 ns on Skylake.