Memory Allocation
Buddy Allocator
Split on alloc, merge on free — power-of-two blocks the kernel can coalesce in O(1)
A buddy allocator manages memory as power-of-two blocks. Splits halve blocks on alloc; frees merge sibling buddies — fast, bounded-fragmentation page allocation.
- Block sizes2^0 page .. 2^(MAX_ORDER-1) pages
- Linux default range4 KB to 4 MB (MAX_ORDER = 11)
- Alloc costO(log MAX_ORDER) splits
- Free costO(log MAX_ORDER) coalesces
- Buddy addressX XOR (1 << order)
- Internal fragmentationBounded, < 2× requested
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a buddy allocator works
Think of memory as a single block of 2^N pages. The buddy allocator keeps one free list per "order" — order 0 is one page (4 KB), order 1 is two contiguous pages, order 10 is 1024 contiguous pages (4 MB). At first only the topmost order is populated with one big block; all other lists are empty.
An allocation for K pages rounds up to the smallest order k with 2^k ≥ K, then asks for a block at that order. If the order-k list is empty, the allocator looks at order k+1: take one block, split it in half, push the right half onto the order-k free list, and recurse with the left half (which may itself need splitting). On free, the allocator computes the buddy address — for a block at offset X and order k, the buddy lives at offset X XOR (1 << k) — and checks whether the buddy is also free. If yes, coalesce into an order-(k+1) block and retry the merge one level up. This walk takes at most MAX_ORDER steps.
order block size sample state before alloc(16 pages)
10 1024 pages [ ▣ one free 1024-page block ]
9 512 pages [ ]
8 256 pages [ ]
7 128 pages [ ]
6 64 pages [ ]
5 32 pages [ ]
4 16 pages [ ] ← target order
3 8 pages
...
Asking for 16 pages (order 4):
Step 1 (split 1024 → 512+512): o10 [ ] o9 [ ▣ ▢ ]
Step 2 (split 512 → 256+256): o9 [ ▣ ] o8 [ ▢ ▣ ]
Step 3 (split 256 → 128+128): o8 [ ▣ ] o7 [ ▢ ▣ ]
Step 4 (split 128 → 64 +64 ): o7 [ ▣ ] o6 [ ▢ ▣ ]
Step 5 (split 64 → 32 +32 ): o6 [ ▣ ] o5 [ ▢ ▣ ]
Step 6 (split 32 → 16 +16 ): o5 [ ▣ ] o4 [ ▢ ▣ ]
Allocate the left 16-page block (▢ becomes "in use").
Result: kept right-half blocks free at o5, o6, o7, o8, o9; one o4 in use.
The same 6 splits run for any first allocation up to 16 pages — afterward
there are free blocks at every order so subsequent allocs are cheap.
On free, the same path is walked backwards. A handed-back order-4 block whose buddy is still free coalesces with it into an order-5; if that order-5's buddy is also free, into an order-6; and so on. Eventually the structure returns to one giant block at order 10.
The moving pieces
Free-area lists. One doubly-linked list per order. Linux's struct free_area array lives inside each zone — the buddy is per-zone, not global.
Order metadata. Each page's order is stored in page->private when the page is a free buddy block. Knowing the order is what makes buddy XOR-computation possible.
Migrate type per block. Free blocks are further classified — UNMOVABLE, MOVABLE, RECLAIMABLE, CMA — so the compactor and reclaim can act selectively. The same buddy structure runs once per migrate type.
Allocation context. GFP flags decide which zone the buddy runs in (zones partition physical memory by purpose). The buddy itself is identical across zones; the zone is just the address range it manages.
Why a buddy allocator
- Bounded fragmentation. The worst-case allocation wastes less than half the block — a request for 65 pages becomes an order-7 (128-page) block, 63 pages wasted internally. Smaller requests waste correspondingly less, and the waste returns to the free list on free.
- O(1) buddy address. Coalescing is a single XOR, not a search through a free tree. Free is genuinely cheap.
- Physically contiguous blocks. By construction, every block the buddy hands out is physically contiguous — exactly what DMA, hugepages, and the slab allocator need.
- Compaction-friendly. A page can be moved and the buddy still tracks its new free state correctly; coalescing simply happens when neighbors line up.
Buddy allocators lose for many small (sub-page) requests, where slab is better; for very-large allocations that exceed MAX_ORDER, where CMA or hugetlb reserve memory in advance; and for workloads where coalescing never catches up (constant churn of mid-order blocks at very high rate).
Buddy vs other memory allocators
| Buddy | Slab | First-fit / best-fit | Segregated free lists | Bump allocator | Region (arena) | |
|---|---|---|---|---|---|---|
| Granularity | 2^k pages | Fixed object size | Variable bytes | Discrete size classes | Bytes (no free) | Bytes, bulk free |
| Alloc cost | O(log MAX_ORDER) | O(1) pop | O(N) search | O(1) per class | O(1) pointer bump | O(1) pointer bump |
| Free cost | O(log MAX_ORDER) merge | O(1) push | O(N) coalesce | O(1) push | Not supported | O(1) region reset |
| Internal frag | Up to ~50% | Near zero | Low | Class rounding | None | None |
| External frag | From split history | None | Common | Common | N/A | N/A |
| Used by | Linux page allocator, FreeBSD vm_phys | kernel objects, jemalloc tcache | Old C malloc | tcmalloc, mimalloc | Compilers, parsers | Game engines, Erlang heaps |
The buddy's signature property is the symmetric structure of alloc and free: both are recursive walks across orders bounded by MAX_ORDER, and both rely on the same XOR-based sibling lookup. Few other allocators have such a clean reversibility.
What the numbers actually look like
- Orders 2^0 to 2^10 in Linux default. MAX_ORDER = 11 means free-area arrays span order 0 (4 KB) to order 10 (4 MB). Some configs bump to MAX_ORDER = 12 or higher for 1 GB gigantic-page support.
- Internal fragmentation bounded by 2×. Worst case: request 2^k + 1 pages, get 2^(k+1) — half the block is wasted. Across millions of allocations of varied sizes the average waste is well under 50% and is fully reclaimable on free.
- Coalesce walk < log₂(memory) steps. Even on a 1 TB-physical-memory NUMA box, MAX_ORDER bounded by zone size limits each free to under 20 buddy checks — microseconds at worst.
- /proc/buddyinfo shows per-order free counts. Healthy zones show non-zero counts across many orders. Fragmented zones cluster everything at order 0 — symptom of failed high-order allocations and a trigger for kcompactd.
- External fragmentation forces compaction. Allocations >= order 3 occasionally fail under high uptime; the kernel runs
vm.compaction_proactiveness-driven sweeps to relocate movable pages and reform high-order blocks. - Pages per second in steady state. A loaded Linux server allocates and frees tens of millions of pages per second through the buddy — measurable via
/proc/vmstat'spgalloc_*andpgfreecounters.
JavaScript implementation
// A power-of-two buddy allocator. Pages addressed by index 0..(1<<MAX)-1.
// Each order keeps a Set of free block start-offsets.
class BuddyAllocator {
constructor(maxOrder = 10) { // 1024 pages → 4 MB if 4 KB pages
this.maxOrder = maxOrder;
this.free = [];
for (let k = 0; k <= maxOrder; k++) this.free.push(new Set());
this.free[maxOrder].add(0); // one big block at startup
this.orderOf = new Map(); // allocated start → order
}
_splitDown(order, targetOrder) {
// Pop a block from `order`, split until reaching targetOrder.
while (order > targetOrder) {
const startIter = this.free[order].values().next();
const start = startIter.value;
this.free[order].delete(start);
const half = 1 << (order - 1);
this.free[order - 1].add(start); // left half goes free
this.free[order - 1].add(start + half); // right half goes free
order--;
}
}
alloc(pages) {
const order = Math.max(0, Math.ceil(Math.log2(Math.max(1, pages))));
if (order > this.maxOrder) return null;
// Find smallest order with a free block ≥ requested order.
let k = order;
while (k <= this.maxOrder && this.free[k].size === 0) k++;
if (k > this.maxOrder) return null; // OOM at this size
this._splitDown(k, order);
const startIter = this.free[order].values().next();
const start = startIter.value;
this.free[order].delete(start);
this.orderOf.set(start, order);
return { start, order, size: 1 << order };
}
free_(block) {
let { start, order } = block;
while (order < this.maxOrder) {
const buddy = start ^ (1 << order); // XOR gives sibling address
if (!this.free[order].has(buddy)) break; // buddy busy → stop merging
this.free[order].delete(buddy);
start = Math.min(start, buddy); // lower address survives
order++; // merged block lives one order up
}
this.free[order].add(start);
this.orderOf.delete(block.start);
}
buddyinfo() { // mimic /proc/buddyinfo
return this.free.map((s, k) => ({ order: k, count: s.size }));
}
}
// Use:
const ba = new BuddyAllocator(10);
const a = ba.alloc(16); // 16 pages → order 4 → splits down from order 10
const b = ba.alloc(64); // 64 pages → order 6 → reuses split tree
ba.free_(a); // free order 4; coalesces with right buddy if free
console.log(ba.buddyinfo());
Two lines do all the structural work: start ^ (1 << order) computes the buddy, and Math.min(start, buddy) picks which address represents the merged larger block. The rest is just bookkeeping.
Python implementation — with migrate types
from math import log2, ceil
from collections import defaultdict
from typing import Optional
MOVABLE, UNMOVABLE, RECLAIMABLE = 'movable', 'unmovable', 'reclaimable'
class Buddy:
"""Power-of-two buddy allocator with per-migrate-type free lists."""
def __init__(self, max_order: int = 10) -> None:
self.max_order = max_order
# free[order][migrate_type] = set of start offsets
self.free: list[dict[str, set[int]]] = [
defaultdict(set) for _ in range(max_order + 1)
]
self.free[max_order][MOVABLE].add(0)
# metadata for allocated blocks
self.allocated: dict[int, tuple[int, str]] = {}
@staticmethod
def _order_for(pages: int) -> int:
return max(0, ceil(log2(max(1, pages))))
def _split_down(self, order: int, target: int, mt: str) -> int:
"""Take one block at `order` (of migrate type `mt`), split to `target`. Return start."""
start = self.free[order][mt].pop()
while order > target:
half = 1 << (order - 1)
self.free[order - 1][mt].add(start + half) # right half free
order -= 1
return start
def alloc(self, pages: int, mt: str = MOVABLE) -> Optional[tuple[int, int]]:
target = self._order_for(pages)
if target > self.max_order: return None
# Walk orders looking for a free block of migrate type `mt`.
for k in range(target, self.max_order + 1):
if self.free[k][mt]:
start = self._split_down(k, target, mt)
self.allocated[start] = (target, mt)
return start, target
# Fallback: steal from another migrate type at a high order, mark it mt.
for k in range(target, self.max_order + 1):
for src in (UNMOVABLE, RECLAIMABLE, MOVABLE):
if self.free[k][src]:
start = self.free[k][src].pop()
while k > target:
half = 1 << (k - 1)
self.free[k - 1][mt].add(start + half)
k -= 1
self.allocated[start] = (target, mt)
return start, target
return None
def free_(self, start: int) -> None:
order, mt = self.allocated.pop(start)
while order < self.max_order:
buddy = start ^ (1 << order)
if buddy not in self.free[order][mt]:
break
self.free[order][mt].remove(buddy)
start = min(start, buddy)
order += 1
self.free[order][mt].add(start)
def buddyinfo(self) -> list[dict[str, int]]:
return [
{'order': k, 'movable': len(self.free[k][MOVABLE]),
'unmovable': len(self.free[k][UNMOVABLE]),
'reclaimable': len(self.free[k][RECLAIMABLE])}
for k in range(self.max_order + 1)
]
The migrate-type axis is what makes Linux's buddy compaction-friendly. Free movable blocks coalesce with movable buddies; the compactor can relocate movable pages to free up neighbors of unmovable ones; the buddy's XOR identity continues to work because the address arithmetic is independent of the migrate-type bookkeeping.
Variants and real-world deployments
Knuth's original buddy system (1968). The Art of Computer Programming, vol. 1, describes the classic algorithm: power-of-two free lists, XOR-buddy lookup, recursive split/merge.
Linux page allocator. The canonical production buddy. Per-zone, per-migrate-type, integrated with the per-CPU pageset cache, kcompactd defragmentation, and CMA reservations.
FreeBSD vm_phys. Multi-segment buddy with NUMA domain awareness. Different segmentation strategy from Linux zones but identical buddy mechanics.
seL4 untyped memory. A capability-based microkernel where untyped (raw physical) memory is allocated and retyped into kernel objects. The retype operation is buddy-like — split or coalesce capability ranges.
jemalloc extents. Userspace allocator with buddy-style coalescing for its arena-level extent management. Tracks dirty/clean extents and madvise-releases dirty pages periodically.
RTOS allocators (FreeRTOS heap_4). Embedded systems often ship a buddy or buddy-flavored allocator because bounded coalescing latency matters for hard real-time deadlines.
Hugepage subsystem. Hugepages are pre-reserved order-9 (2 MB) or order-18 (1 GB) blocks. Either come from the buddy at boot or are carved out of CMA — both lean on the buddy's contiguity guarantees.
Common bugs and edge cases
- Failure to find an order despite plenty of memory. Classic external-fragmentation hit: hundreds of MB free but all at order 0. Symptom:
__alloc_pagesfailure for order 4+. Fix: compaction or proactive defrag. - Double-free corruption. Freeing the same block twice puts it on the order's free list twice, and the next alloc returns it to two callers. Linux uses
page->flagsandpage->privatesanity checks; production builds keep some checks even outside DEBUG. - Cross-migrate-type pollution. A misbehaving allocator hands out a movable block as if it were unmovable; later, the compactor refuses to migrate it. Long-term: high-order allocations never succeed even though aggregate free is huge.
- MAX_ORDER ceiling for 1 GB pages. Standard MAX_ORDER = 11 cannot express a 1 GB block. Gigantic hugepages live in a parallel reservation pool (hugetlbfs) rather than going through the buddy.
- NUMA-cross coalesce attempts. Each zone has its own buddy; coalescing across zone boundaries is forbidden, even if addresses look adjacent. Bugs that drop this check have created cross-zone "buddies" that crashed the kernel.
- Order-0 thrash. A workload that churns single-page allocs through the buddy at high rate beats the per-CPU pageset cache and contends on the zone lock. Symptom: ~30% CPU in
__rmqueue. - Off-by-one MAX_ORDER. Some allocations sit at exactly MAX_ORDER-1 (the largest representable buddy block). Code that asks for "MAX_ORDER pages" instead of "(1 << (MAX_ORDER-1)) pages" silently fails — a perennial papercut.
Frequently asked questions
Why power-of-two blocks?
Power-of-two sizes have an O(1) buddy address: the buddy of block at offset X size 2^k is the block at offset X XOR 2^k. Coalescing on free reduces to a bit flip and a list-membership check. Any other size would require interval-tree lookups for sibling discovery. The cost is internal fragmentation — a 17-page request rounds up to 32 pages — but the rounding is bounded and the bookkeeping is constant-time.
How does coalescing actually work?
On free, the allocator XORs the freed block's address with its size to compute its buddy's address. It then checks whether the buddy is on the free list of the same order. If yes: remove the buddy from its free list, take the lower address of the two as the new block of order k+1, and recurse — the new larger block may itself have a free buddy. The recursion terminates when buddies are not both free, or when the maximum order is reached. The whole walk is O(log MAX_ORDER), so under 20 steps even for huge memories.
What is MAX_ORDER in Linux?
MAX_ORDER is the largest power-of-two block the buddy allocator manages. Default Linux: MAX_ORDER = 11, so orders 0..10 cover 2^0 = 1 page (4 KB) through 2^10 = 1024 pages (4 MB). Some architectures bump it up to support gigantic hugepages contiguously. Allocations larger than 2^(MAX_ORDER-1) pages have to come from CMA, hugetlb, or contiguous-memory carve-outs because the buddy refuses to express them as a single block.
What is the difference between a buddy allocator and slab?
Buddy hands out variable-power-of-two contiguous page blocks; slab hands out fixed-size objects. The two stack: a slab cache refills itself by requesting one or more pages from the buddy (typically a single page, sometimes order-1 or order-2 for larger objects). Slab amortizes the buddy's split/merge work across many objects of the same size, and the buddy in turn makes sure each slab gets a physically contiguous run.
What is external fragmentation in a buddy allocator?
External fragmentation appears when the free list at some order is large but every entry is split across non-mergeable buddies. /proc/buddyinfo is the diagnostic: a healthy zone has counts at multiple orders; a fragmented one has thousands at order 0 and zero at order 6+. To allocate an order-9 huge page in this state, the kernel must run the compactor to physically move movable pages out of the way, coalesce released buddies, and re-form a large contiguous block.
How does Linux track buddies?
Each zone keeps an array of free_area structures, one per order. Each free_area holds a list of free blocks at that order, broken out by migrate type (UNMOVABLE, MOVABLE, RECLAIMABLE, CMA, etc.). The struct page for each first-page-of-block holds the order in page->private and the migrate type in page flags. Coalescing checks the buddy's page->private to confirm it is also a free block of the same order before merging.