Systems

Paging

Translate every address in 30 cycles or less

Paging splits memory into fixed-size pages (typically 4 KB) and translates virtual addresses to physical via per-process page tables. The MMU walks the table on every access, the TLB caches recent translations, and the OS pages cold data out to disk on demand.

  • Default page size4 KB
  • Hugepage sizes (x86_64)2 MB, 1 GB
  • Levels (x86_64 4-level)PML4 → PDPT → PD → PT
  • L1 dTLB entries (typical)64
  • Full page-walk cost30–100 cycles

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How paging translates an address

Every memory access your program makes goes through two address spaces: the virtual address the CPU computes, and the physical address the DRAM actually answers to. Paging is the bookkeeping that maps one to the other in fixed-size chunks.

On x86_64 with 4 KB pages, a 48-bit virtual address breaks down like this:

bits 47..39  →  PML4 index    (top-level table, 1 per process)
bits 38..30  →  PDPT index    (page-directory pointer table)
bits 29..21  →  PD   index    (page directory)
bits 20..12  →  PT   index    (page table)
bits 11.. 0  →  offset within the 4 KB page

The MMU does a four-level walk: read PML4[i] to find the PDPT base, read PDPT[j] to find the PD base, read PD[k] to find the PT base, read PT[l] to find the physical frame number, then concatenate that with the 12-bit offset to produce the physical address. Five memory reads per translation — which would be catastrophic if not for the TLB.

The TLB caches recent virtual-to-physical translations. A hit returns the physical address in 1 cycle; a miss triggers the walk above. On a typical x86 server CPU, L1 dTLB has 64 entries for 4 KB pages, plus a unified L2 TLB of a few thousand entries. With 4 KB pages those 64 entries cover 256 KB of working set; with 2 MB hugepages, the same 64 entries cover 128 MB. That's why hugepages matter for in-memory databases.

A 4-level walk, step by step

; Virtual addr 0x00007fff_a1b2_c3d4 = 0b 000000000_111111111_111101000_110110001_011000111101_000111010100
;                                       PML4=0    PDPT=511   PD=488     PT=433       offset=0x3d4
walk(va):
    cr3 = current_pgd                  # CR3 holds the physical addr of PML4
    pml4e = mem[cr3 + pml4_index(va)*8]
    if not pml4e.present: page_fault()
    pdpte = mem[pml4e.frame*4096 + pdpt_index(va)*8]
    if pdpte.huge_1g: return pdpte.frame*4096 + (va & 0x3FFFFFFF)
    pde = mem[pdpte.frame*4096 + pd_index(va)*8]
    if pde.huge_2m:  return pde.frame*4096 + (va & 0x1FFFFF)
    pte = mem[pde.frame*4096 + pt_index(va)*8]
    if not pte.present: page_fault()
    return pte.frame*4096 + (va & 0xFFF)

Each level entry is 8 bytes; each table is 4 KB (512 entries). A single PTE encodes the physical frame number plus permission bits — present, writable, user/supervisor, accessed, dirty, NX (no-execute), and PCD/PWT for cache attributes. The "Accessed" and "Dirty" bits are what the kernel uses to decide which pages are cold (eviction) and which need writeback (flush).

Paging vs segmentation vs combined

Pure pagingPure segmentationSegments + pagingInverted page tableHugepagesSingle-level
Unit4 KB pageVariable segmentBothFixed page2 MB / 1 GB pageFixed page
Translation4-level walk + TLBLimit + base checkSegment + page walkHash probe2- or 3-level walkOne indirection
FragmentationInternal onlyExternal growsBothInternal onlyInternal (worse)Internal only
SharingPageSegment (object)EitherPageHugepagePage
PT overhead~0.2% of RAMTiny per segmentSumLinear in phys RAM~512× lessLinear in addr space
Used byLinux, Windows, macOSOS/2 1.x, DOS extendersx86 protected mode (legacy)PowerPC, IA-64Linux THP, DBs, JVMsToy / academic

Pure segmentation suffers external fragmentation and complicates sharing — variable sizes force the kernel to play Tetris with RAM. Paging trades that for bounded internal fragmentation (half-empty tail page per region). x86 historically did segmentation then paging because protected mode bolted onto a segmented model; long mode flattened segments, so modern systems run essentially "pure" paging with hugepages layered on.

C: probe page-table info from /proc

// Linux exposes the page table via /proc/self/pagemap (one 64-bit entry per page).
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/mman.h>

int main(void) {
    void *p = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0);
    *(volatile char*)p = 1;                 // force allocation

    int fd = open("/proc/self/pagemap", O_RDONLY);
    uint64_t vpn = (uintptr_t)p >> 12;       // virtual page number
    uint64_t entry;
    pread(fd, &entry, 8, vpn * 8);

    int present  = (entry >> 63) & 1;
    int swapped  = (entry >> 62) & 1;
    uint64_t pfn = entry & ((1ULL << 55) - 1);   // physical frame number
    printf("vpn=%llx pfn=%llx present=%d swap=%d\n", vpn, pfn, present, swapped);
}

Each /proc/self/pagemap entry tells you the physical frame, whether the page is resident, and whether it's been swapped. Combined with /proc/kpageflags you can see hugepages, NUMA nodes, and dirty bits — useful for diagnosing why a workload thrashes the TLB.

Python: madvise hugepages and faulting cost

import mmap, time

SIZE = 1 << 30                           # 1 GiB
m = mmap.mmap(-1, SIZE, mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS,
              mmap.PROT_READ | mmap.PROT_WRITE)
m.madvise(mmap.MADV_HUGEPAGE)            # request transparent hugepages
m.madvise(mmap.MADV_POPULATE_WRITE)      # prefault to avoid faults during the loop

t0 = time.perf_counter()
view = memoryview(m)
for off in range(0, SIZE, 4096):
    view[off] = 1                        # touch every page
print(f"touched {SIZE >> 30} GiB in {(time.perf_counter()-t0)*1e3:.1f} ms")

MADV_POPULATE_WRITE (Linux 5.14+) prefaults the entire mapping with write permissions in one call, avoiding millions of minor faults during the loop. MADV_HUGEPAGE tells the kernel "promote me to 2 MB pages if you can". On a workload that touches the whole region, hugepages typically cut walk cost by 4× and TLB miss rate by 100×.

Node.js: huge V8 heaps and TLB cost

// Run with: node --max-old-space-size=16384 --huge-max-old-generation-size big.js
const buf = Buffer.allocUnsafe(8 * 1024 * 1024 * 1024); // 8 GiB
let sum = 0;
const start = process.hrtime.bigint();
for (let off = 0; off < buf.length; off += 4096) sum += buf[off];
const ns = Number(process.hrtime.bigint() - start);
console.log(`scan: ${(ns / 1e6).toFixed(0)} ms`);

// Compare: same scan with --huge-max-old-generation-size enables hugepages on Linux,
// reducing the TLB-miss-driven cost of walking 2 million PTEs to 4 thousand.

V8 with the --huge-max-old-generation-size flag uses madvise(MADV_HUGEPAGE) on the old generation. On scans of multi-GB Node heaps the speedup is typically 1.1–1.3×; on memory-bandwidth-bound code (graph traversal, JSON serialization) the gain comes mostly from saved TLB walks rather than fewer cycles spent in the loop body.

Page-size and policy variants

  • 4 KB base pages. The default everywhere. Each PTE covers 4 KB; a 4-level walk is 5 memory reads.
  • 2 MB hugepages. One PTE covers 2 MB; the walk skips the bottom level. Each TLB entry now reaches 512× more memory.
  • 1 GB gigantic pages. One PTE covers 1 GB; walk skips the bottom two levels. Reserved at boot via hugepages= on the kernel command line, used by HPC and DBs with tens of GB working sets.
  • Transparent Hugepages (THP). Kernel opportunistically promotes contiguous 4 KB pages to 2 MB. Fewer config knobs, occasional pause-time spikes from khugepaged defragmentation. Can be set always, madvise, or never.
  • NUMA-aware paging. On multi-socket systems the kernel tries to allocate a process's pages on the node where it runs (numa_balancing). Cross-socket access is 1.5–3× slower.
  • 5-level paging. Linux supports a 57-bit virtual address space via a fifth table level on Ice Lake and later — useful for systems with >128 TB of memory.
  • Demand paging vs prefaulting. Default is lazy: pages fault on first touch. Latency-sensitive code prefaults with MAP_POPULATE, MADV_WILLNEED, or mlock.

Costed claims

  • 4 KB page: 12 bits of offset, 36 bits of PFN on x86_64. PTE is 8 bytes, page table is 4 KB — fits exactly one page.
  • TLB reach: 64 dTLB entries × 4 KB = 256 KB. With 2 MB hugepages, 64 × 2 MB = 128 MB — a 512× increase per TLB entry.
  • Page-walk cost: 4 dependent memory reads, ~30 cycles best case (all in L1), 100+ cycles when the table itself is in L3 or DRAM.
  • Page-table size: ~0.2% overhead — a 1 GB process needs ~2 MB of page tables. Hugepages cut that to ~0.05%.
  • Minor fault: ~1 µs (allocate frame, install PTE, return). Major fault: ~100 µs on NVMe SSD, ~10 ms on spinning disk.
  • TLB shootdown: ~5–20 µs cross-core; on big systems with many CPUs, batched TLB invalidations are a measurable cost of memory unmapping.

Common bugs and edge cases

  • Page-fault storm. A process touches a huge mapping non-sequentially and faults on every page. Symptoms: high page_faults in /proc/PID/stat, low IPC. Fix: MAP_POPULATE, MADV_WILLNEED, or warm up sequentially.
  • TLB shootdown stalls. Unmapping or changing protections requires invalidating other CPUs' TLBs (an IPI). On 96-core boxes this can hit 10s of microseconds. Batched mprotect and mremap reduce the count.
  • THP latency tail. khugepaged running compaction can introduce stop-the-world stalls in the multi-millisecond range. Latency-critical apps switch THP to madvise or disable it.
  • NUMA placement bug. A worker thread allocates its pages from one socket then migrates; subsequent accesses cross sockets. Fix: numactl --membind or pin threads with numactl --cpunodebind.
  • Huge swap thrash. When RSS exceeds RAM the kernel evicts pages to swap; once you start hitting them again it spirals. vm.swappiness, zswap, and proper sizing matter.
  • 32-bit address-space exhaustion. 3 GB of usable virtual memory; large mmaps fail with ENOMEM even with plenty of physical RAM. Ancient bug class still alive in embedded contexts.
  • Forgot to flush after PTE change. Kernel code that updates a PTE without invalidating the TLB sees stale translations — a very subtle source of memory corruption when writing kernel modules.

Frequently asked questions

Why 4 KB pages?

It's a balance. Smaller pages waste more on internal fragmentation and need bigger page tables; larger pages reduce TLB pressure but waste memory on partially-used regions. 4 KB has been the sweet spot since the 386. Modern CPUs also support 2 MB and 1 GB hugepages for workloads that benefit from less translation overhead.

What does the TLB do?

The Translation Lookaside Buffer caches virtual-to-physical translations so the MMU doesn't have to walk the four-level page table on every access. A typical x86 has 64 L1 dTLB entries plus a few thousand in L2; a hit costs a single cycle, a miss triggers a 4-level walk costing 30–100 cycles.

What's a page fault and how is it different from a segfault?

A page fault is the MMU finding no valid PTE for an address. The kernel decides what to do: minor fault (page exists in cache, just install a PTE), major fault (page must be read from disk), or invalid access (deliver SIGSEGV — that's the segfault). Page faults are routine; segfaults are bugs.

Why are 1 GB hugepages useful?

Each TLB entry covers 1 GB instead of 4 KB — a 262,144× increase in TLB reach. Databases and JVMs with large heaps see TLB-miss rates drop from a few percent to near zero, recovering 5–15% throughput on memory-bound workloads.

Does swapping still matter on machines with lots of RAM?

Yes — even on RAM-rich servers the kernel uses anonymous swap to evict cold pages and free RAM for the page cache. zswap and zram compress evicted pages instead of writing them to disk. On laptops with NVMe, swap to disk costs a couple hundred microseconds and is often invisible.