Memory Allocation

Zone Allocator

Partition physical memory by purpose — DMA, normal, movable — then allocate from the right pool

A zone allocator divides physical RAM into hardware-purpose ranges — DMA, DMA32, normal, highmem, movable — each with its own free-page lists, watermarks, and fallback order. Linux's page allocator runs on top of it.

  • Zones in LinuxDMA, DMA32, NORMAL, HIGHMEM, MOVABLE, DEVICE
  • ZONE_DMA rangeFirst 16 MB (ISA legacy)
  • ZONE_DMA32 rangeFirst 4 GB
  • Per-CPU cacheLockless order-0 allocations
  • WatermarksMIN, LOW, HIGH
  • NUMA-awareZonelist interleaves nodes by distance

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a zone allocator works

Physical memory looks uniform from a 64-bit programmer's perspective. It isn't. An old PCI card may only DMA into the first 4 GB. An ISA card may only reach the first 16 MB. On 32-bit Linux, the kernel can only directly map the low ~896 MB at any moment. Hugepage allocation needs physically contiguous megabytes. A NUMA box has separate banks per socket with different access latencies. The zone allocator is the abstraction Linux uses to encode all of those constraints into one structure: "from which range of physical memory can this request be served?"

Physical memory layout (x86_64, one NUMA node)
0           16MB         4GB                            top
│  ZONE_DMA  │ ZONE_DMA32 │      ZONE_NORMAL            │
                                  └─[ZONE_MOVABLE]──────┘  (optional)

Each zone holds:
  • Free-page lists per order (0..MAX_ORDER, default MAX_ORDER=11 → 4 MB blocks)
  • Per-CPU pagesets (lockless order-0 caches per CPU)
  • Watermarks (MIN, LOW, HIGH)
  • LRU lists (active/inactive) used by reclaim
  • Statistics counters (free pages, allocated, reclaimed)

An allocation request arrives with GFP (get-free-pages) flags such as GFP_KERNEL, GFP_DMA, GFP_DMA32, GFP_HIGHUSER_MOVABLE. Those flags select a preferred zone. The zonelist for that zone gives a fallback order: prefer local NUMA node, prefer this zone, then progressively cheaper substitutes. The allocator walks the zonelist, watermark-checks each candidate, and on the first one with enough free pages delegates to the buddy allocator for the actual order-N block split.

The moving parts of a Linux zone

Free-page lists. Each zone holds MAX_ORDER doubly-linked lists, one per power-of-two block size. Order 0 is one page (4 KB). Order 1 is two contiguous pages. Order 10 is 1024 pages (4 MB). The buddy allocator splits a higher-order block when no lower-order block is free, and merges buddies on free.

Per-CPU pagesets. A small per-CPU cache of order-0 pages (default high watermark ~32-512 pages per CPU). Allocation pulls from the local cache without a zone lock; only batch refills hit the lock. This is the difference between linear-scaling and contention-bound allocators under many-core load.

Watermarks. Three thresholds — MIN, LOW, HIGH — per zone. Free count below LOW wakes kswapd to reclaim asynchronously. Below MIN, the allocator stalls and runs direct reclaim in the calling task. Above HIGH, reclaim is idle.

Migration types. Inside each zone, free-page lists are further split by migrate type: UNMOVABLE (kernel pinned), RECLAIMABLE (caches), MOVABLE (user pages, page cache), CMA. Keeping movable and unmovable pages separated is what makes compaction productive — defragmentation only succeeds in the MOVABLE migration list.

Why a zone allocator is necessary

  • Hardware constraints. DMA-capable devices have addressability limits. A 32-bit PCI card cannot scatter into pages above 4 GB. The zone allocator guarantees the kernel can hand out a low-physical page when an old device asks.
  • Reclaim priority. Highmem and movable pages are cheap to reclaim — they can be swapped or evicted. Normal-zone pages are precious because the kernel's direct map relies on them. Per-zone watermarks let reclaim aggressively pressure highmem before touching normal.
  • Huge-page enablement. Transparent huge pages need physically contiguous 2 MB extents. ZONE_MOVABLE plus the compactor collect movable pages so that the buddy allocator can stitch together higher-order blocks even after years of uptime.
  • NUMA locality. Each node has its own zones. Allocations default to local; cross-socket pages cost 30-100 percent more access latency. The zonelist encodes this preference.

Zone allocator vs other allocator layers

Zone allocatorBuddy allocatorSlab/SLUB/SLOBvmallockmallocUserspace malloc
GranularityWhole zones (ranges)Power-of-2 pagesObjects (cache-sized)Virtually contiguous pages≤8 KB objectsBytes
Picks zoneYesNo (lives inside one)No (lives inside one)NoVia slabN/A
OutputFree-page list reference2^k page blockObject pointerVirtual address rangeKernel pointerUser pointer
Lock granularityPer-zonePer-zone-orderPer-cacheGlobal vmap areaInherited from slabPer-thread arena
NUMA-awareYesVia zoneYes (per-node cache)IndirectlyVia slabGlibc malloc: yes
Layer in Linuxmm/page_alloc.cmm/page_alloc.cmm/slub.cmm/vmalloc.cmm/slab_common.clibc

In practice these layers stack. kmalloc(64) → slab cache → buddy refill at order 0 → zone selected by GFP flags → per-CPU pageset hit. The zone allocator is the layer that maps a high-level "I need clean kernel memory" into a specific physical range with the right hardware properties.

What the numbers actually look like

  • Zone enumeration. Linux defines up to six zones: ZONE_DMA, ZONE_DMA32, ZONE_NORMAL, ZONE_HIGHMEM (32-bit only), ZONE_MOVABLE, ZONE_DEVICE. On a typical 64-bit server: DMA (~16 MB), DMA32 (~4 GB), NORMAL (the rest), optionally MOVABLE carved out by kernelcore= or movablecore=.
  • Watermarks ~0.5% of zone size. min_free_kbytes defaults to about 5-10 MB on a 16 GB box; watermark_scale_factor = 10 means LOW is ~0.1% above MIN, HIGH is ~0.2% above MIN.
  • Per-CPU pagelist high. Default 6× batch size. Tuned via vm.percpu_pagelist_high_fraction. A 96-core box with 256 pages per CPU caches roughly 25,000 pages (~100 MB) before touching the zone lock.
  • Zonelist size. On an 8-node NUMA box with 4 zones each, the zonelist has 32 entries — every zone of every node in distance order. This is what allocators walk during direct reclaim.
  • Fragmentation telemetry. /proc/buddyinfo shows per-zone free counts at each order; /proc/pagetypeinfo breaks them out further by migrate type. A long-running kernel often shows most free pages stuck at order 0, evidence that compaction needs to run.
  • NUMA penalty. Remote-node access is 1.3-2× slower; cross-zone fallback within a node is free. The zonelist is built to prefer local-node fallback over remote-same-zone.

JavaScript implementation (sketch)

// A toy zone allocator inspired by Linux. Each zone has free-page lists and
// watermarks. Allocation picks a zone by GFP flag, falls back on shortage.

const GFP = {
  DMA:    'DMA',
  DMA32:  'DMA32',
  NORMAL: 'NORMAL',
  MOVABLE: 'MOVABLE',
};

class Zone {
  constructor(name, startPfn, endPfn) {
    this.name = name;
    this.startPfn = startPfn;            // page-frame numbers [start, end)
    this.endPfn = endPfn;
    this.freePages = new Set();          // pfns currently free
    for (let p = startPfn; p < endPfn; p++) this.freePages.add(p);
    const total = endPfn - startPfn;
    this.wmark = { min: total * 0.005, low: total * 0.01, high: total * 0.02 };
  }
  freeCount() { return this.freePages.size; }
  okToAllocate(n) { return this.freeCount() - n >= this.wmark.min; }
  alloc(n) {
    if (!this.okToAllocate(n)) return null;
    const pfns = [];
    for (const p of this.freePages) {
      pfns.push(p);
      if (pfns.length === n) break;
    }
    for (const p of pfns) this.freePages.delete(p);
    return pfns;
  }
  free(pfns) { for (const p of pfns) this.freePages.add(p); }
}

class ZoneAllocator {
  constructor() {
    // Pretend physical layout: pfn 0..3 = DMA, 4..63 = DMA32, 64..1023 = NORMAL.
    this.zones = {
      DMA:    new Zone('ZONE_DMA',    0,    4),
      DMA32:  new Zone('ZONE_DMA32',  4,    64),
      NORMAL: new Zone('ZONE_NORMAL', 64,   1024),
      MOVABLE: new Zone('ZONE_MOVABLE', 768, 1024),  // overlaps NORMAL high range
    };
    // Per-flag zonelist (preferred → fallback).
    this.zonelist = {
      DMA:    [this.zones.DMA],
      DMA32:  [this.zones.DMA32, this.zones.DMA],
      NORMAL: [this.zones.NORMAL, this.zones.DMA32, this.zones.DMA],
      MOVABLE: [this.zones.MOVABLE, this.zones.NORMAL, this.zones.DMA32],
    };
  }
  allocPages(n, flag = GFP.NORMAL) {
    for (const zone of this.zonelist[flag]) {
      const pages = zone.alloc(n);
      if (pages) return { zone: zone.name, pfns: pages };
    }
    throw new Error('OOM');
  }
  freePages(zoneName, pfns) { this.zones[zoneName].free(pfns); }
  status() {
    return Object.fromEntries(
      Object.entries(this.zones).map(([k, z]) => [k, z.freeCount()])
    );
  }
}

const za = new ZoneAllocator();
za.allocPages(4, GFP.DMA);       // pulled from ZONE_DMA
za.allocPages(8, GFP.DMA32);     // pulled from ZONE_DMA32 (or fallback DMA)
za.allocPages(64, GFP.NORMAL);   // pulled from ZONE_NORMAL

The real Linux page allocator buries this logic in __alloc_pages and walks a more nuanced zonelist that combines NUMA node distance with zone fallback order — but the structure is the same: per-zone free lists, watermark gating, preferred-then-fallback selection.

Python implementation — NUMA-aware zonelist

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class Zone:
    name: str
    node_id: int
    start_pfn: int
    end_pfn: int
    free: set = field(default_factory=set)

    def __post_init__(self) -> None:
        self.free = set(range(self.start_pfn, self.end_pfn))
        total = self.end_pfn - self.start_pfn
        self.wmark_min = int(total * 0.005)
        self.wmark_low = int(total * 0.01)
        self.wmark_high = int(total * 0.02)

    def free_count(self) -> int: return len(self.free)

    def alloc(self, n: int) -> Optional[list[int]]:
        if self.free_count() - n < self.wmark_min:
            return None
        pfns = []
        for p in list(self.free):
            pfns.append(p)
            if len(pfns) == n:
                break
        for p in pfns: self.free.discard(p)
        return pfns

    def release(self, pfns: list[int]) -> None:
        self.free.update(pfns)

class NUMAZoneAllocator:
    """Multi-node zone allocator with distance-aware fallback."""
    def __init__(self, node_count: int = 2, memory_per_node_pages: int = 1024) -> None:
        self.nodes = []
        for n in range(node_count):
            base = n * memory_per_node_pages
            self.nodes.append({
                'DMA':    Zone(f'node{n}.DMA',    n, base + 0,    base + 4),
                'DMA32':  Zone(f'node{n}.DMA32',  n, base + 4,    base + 64),
                'NORMAL': Zone(f'node{n}.NORMAL', n, base + 64,   base + memory_per_node_pages),
            })
        # Distance matrix (toy): node N to itself is 10, to others is 20.
        self.distance = [
            [10 if i == j else 20 for j in range(node_count)] for i in range(node_count)
        ]

    def zonelist(self, preferred_node: int, flag: str) -> list[Zone]:
        order = sorted(range(len(self.nodes)),
                       key=lambda n: self.distance[preferred_node][n])
        result: list[Zone] = []
        for n in order:
            zone = self.nodes[n].get(flag)
            if zone is None: continue
            result.append(zone)
            # fall back through cheaper zones within the same node
            if flag == 'NORMAL':
                result.extend([self.nodes[n]['DMA32'], self.nodes[n]['DMA']])
            elif flag == 'DMA32':
                result.append(self.nodes[n]['DMA'])
        # de-duplicate preserving order
        seen = set()
        return [z for z in result if not (z.name in seen or seen.add(z.name))]

    def alloc_pages(self, count: int, flag: str = 'NORMAL', preferred_node: int = 0):
        for zone in self.zonelist(preferred_node, flag):
            pfns = zone.alloc(count)
            if pfns is not None:
                return zone, pfns
        raise MemoryError('OOM — every zone in zonelist exhausted')

This captures the essentials: each zone is its own free-page arena; the zonelist orders zones by preference; allocation walks the list watermark-checking each candidate. The Linux source builds the zonelist at boot time (and on memory hot-add), then uses pre-built zonelists in the hot path instead of recomputing.

Variants and real-world deployments

Linux page allocator. The canonical implementation in mm/page_alloc.c. Zones plus per-CPU pagesets plus the buddy allocator. Tunables: vm.min_free_kbytes, vm.watermark_scale_factor, vm.zone_reclaim_mode.

FreeBSD vm_phys. Conceptually similar — partitions physical memory into free pools by domain (NUMA node) and reservoir (cache color). The vm_phys layer sits below the buddy-equivalent buddy/vm_pageout reclaim mechanism.

illumos / Solaris page coloring zones. Page allocator stripes pages across "cache colors" so consecutive virtual addresses don't collide in the L2. Implemented as a different axis of zoning on top of the standard memseg list.

Windows kernel. Different terminology — "page lists" (Standby, Modified, Free, Zeroed, Bad) replace zones, but the kernel also tracks physical address constraints via the MEMORY_CACHING_TYPE and DMA arrangements through MmAllocateContiguousMemory.

seL4 untyped memory. A capability-based microkernel — physical memory is allocated as Untyped capabilities and retyped into pages, with capability ranges replacing zones. Different model, same problem: associate hardware constraints with allocation policy.

userspace simulators. jemalloc and tcmalloc use "arenas" (per-thread) and "size classes" (per-allocation-size) for similar lock-contention-avoidance reasons; the analogy is direct, though without hardware-zone semantics.

Common bugs and edge cases

  • Low-zone exhaustion under DMA-heavy workloads. Massive raw-disk IO can drain ZONE_DMA32 even when there are gigabytes free in ZONE_NORMAL. The OOM killer fires "in plenty" because the requesting allocator only sees low zones.
  • Skewed reclaim because of zone watermarks. A node with a tiny DMA zone hits MIN on DMA quickly, triggers direct reclaim, and stalls allocations even though aggregate node memory is fine.
  • Pinning movable pages. A user-space process with mlocked pages converts MOVABLE pages into UNMOVABLE, breaking compaction. Long-running databases that pin huge buffers see this.
  • Bad NUMA defaults. Without numactl, an allocator can fill node 0 first and starve node 1's local applications. vm.zone_reclaim_mode = 1 reclaims locally before crossing nodes.
  • kernelcore= too small. ZONE_MOVABLE configured too aggressively leaves so little ZONE_NORMAL that the kernel itself can't allocate slab caches under load.
  • Per-CPU pageset drain on hot-unplug. CPU offline must drain its pagesets back to the zone; bugs here have caused page-list corruption in earlier kernels.
  • Zonelist regression after memory hot-add. Adding memory to a zone requires rebuilding the zonelist. Misbuilt zonelists cause silent NUMA-locality regressions that look like sudden 30% latency drops.

Frequently asked questions

Why does Linux split memory into zones?

Because not every byte of physical memory is equally useful. Legacy ISA DMA devices can only address the first 16 MB (ZONE_DMA). 32-bit PCI devices reach 4 GB (ZONE_DMA32). On 32-bit kernels, the kernel could only directly map the low ~896 MB (ZONE_NORMAL); anything above had to be mapped on demand (ZONE_HIGHMEM). Movable pages (ZONE_MOVABLE) reserve a range from which the kernel will not pin allocations, leaving the area defragmentable. Each zone is its own arena with its own free-page lists, watermarks, and reclaim priority.

What is the fallback order between zones?

Each zone has a zonelist — an ordered list of zones to try when the preferred zone is short on memory. A request for ZONE_NORMAL can fall back to ZONE_DMA32, then ZONE_DMA; a request for ZONE_HIGHMEM falls back to ZONE_NORMAL, then DMA32, then DMA. The reverse is forbidden — a device that needs DMA-addressable memory cannot accept a high-memory page. On NUMA boxes, zonelists also interleave by node distance, so allocations prefer local-node zones before remote-node ones.

What are zone watermarks?

Each zone tracks three thresholds on its free-page count: WMARK_HIGH (above this, reclaim sleeps), WMARK_LOW (below this, kswapd wakes up and reclaims asynchronously), and WMARK_MIN (below this, allocations stall and direct reclaim runs in the requesting task's context). Crossing watermarks fires the page-reclaim machinery — slab shrinkers, LRU page eviction, swap. Tunable via /proc/sys/vm/min_free_kbytes and watermark_scale_factor.

How does ZONE_MOVABLE help memory compaction?

Long-running kernels fragment physical memory — pages get pinned for DMA or for the kernel's direct-map identity range, leaving free pages scattered. Defragmenting them by moving in-use pages requires that those pages be relocatable. ZONE_MOVABLE collects allocations the kernel promises to keep movable: user-space pages, page cache, anonymous pages. The compactor can rearrange them to free contiguous physical extents — which the buddy allocator then merges into high-order blocks, enabling huge-page allocation.

What is per-CPU page caching?

Locking the global zone free list on every single-page allocation would bottleneck multi-core workloads. Linux's pcp (per-CPU pageset) keeps a small cache of free order-0 pages local to each CPU, refilled in batches from the zone's free list. Allocation pulls from the local cache lockless; freeing pushes back to the local cache. Only batch refills or drains touch the zone lock. Tunable via vm.percpu_pagelist_high_fraction.

How are NUMA nodes and zones related?

On NUMA hardware each socket has its own physical memory bank. Each node has its own zones — node 0 has its own ZONE_DMA, ZONE_DMA32, ZONE_NORMAL, etc., as does node 1, node 2, .... Zonelists then interleave node and zone preference: a local ZONE_NORMAL is tried before any remote zone, then progressively distant nodes. Userspace controls NUMA policy via numactl, mbind, and set_mempolicy, but the zone allocator is what physically maps the policy to free-page lists.

Is ZONE_HIGHMEM still relevant?

Not on 64-bit systems, where the kernel's address space is effectively unbounded and the entire physical memory fits in the direct map. 64-bit kernels only build ZONE_DMA (legacy ISA), ZONE_DMA32 (32-bit PCI), ZONE_NORMAL, and optionally ZONE_MOVABLE. ZONE_HIGHMEM exists in the source tree but is conditionally compiled in only for 32-bit configurations — a vanishing case in modern deployments.