It's a balance. Smaller pages waste more on internal fragmentation and need bigger page tables; larger pages reduce TLB pressure but waste memory on partially-used regions. 4 KB has been the sweet spot since the 386. Modern CPUs also support 2 MB and 1 GB hugepages for workloads that benefit from less translation overhead.

What does the TLB do?

The Translation Lookaside Buffer caches virtual-to-physical translations so the MMU doesn't have to walk the four-level page table on every access. A typical x86 has 64 L1 dTLB entries plus a few thousand in L2; a hit costs a single cycle, a miss triggers a 4-level walk costing 30–100 cycles.

What's a page fault and how is it different from a segfault?

A page fault is the MMU finding no valid PTE for an address. The kernel decides what to do: minor fault (page exists in cache, just install a PTE), major fault (page must be read from disk), or invalid access (deliver SIGSEGV — that's the segfault). Page faults are routine; segfaults are bugs.

Why are 1 GB hugepages useful?

Each TLB entry covers 1 GB instead of 4 KB — a 262,144× increase in TLB reach. Databases and JVMs with large heaps see TLB-miss rates drop from a few percent to near zero, recovering 5–15% throughput on memory-bound workloads.

Does swapping still matter on machines with lots of RAM?

Yes — even on RAM-rich servers the kernel uses anonymous swap to evict cold pages and free RAM for the page cache. zswap and zram compress evicted pages instead of writing them to disk. On laptops with NVMe, swap to disk costs a couple hundred microseconds and is often invisible.

Paging — Page Tables, TLB and Demand Paging Explained

How paging translates an address

Every memory access your program makes goes through two address spaces: the virtual address the CPU computes, and the physical address the DRAM actually answers to. Paging is the bookkeeping that maps one to the other in fixed-size chunks.

On x86_64 with 4 KB pages, a 48-bit virtual address breaks down like this:

bits 47..39  →  PML4 index    (top-level table, 1 per process)
bits 38..30  →  PDPT index    (page-directory pointer table)
bits 29..21  →  PD   index    (page directory)
bits 20..12  →  PT   index    (page table)
bits 11.. 0  →  offset within the 4 KB page

The MMU does a four-level walk: read PML4[i] to find the PDPT base, read PDPT[j] to find the PD base, read PD[k] to find the PT base, read PT[l] to find the physical frame number, then concatenate that with the 12-bit offset to produce the physical address. Five memory reads per translation — which would be catastrophic if not for the TLB.

The TLB caches recent virtual-to-physical translations. A hit returns the physical address in 1 cycle; a miss triggers the walk above. On a typical x86 server CPU, L1 dTLB has 64 entries for 4 KB pages, plus a unified L2 TLB of a few thousand entries. With 4 KB pages those 64 entries cover 256 KB of working set; with 2 MB hugepages, the same 64 entries cover 128 MB. That's why hugepages matter for in-memory databases.

A 4-level walk, step by step

; Virtual addr 0x00007fff_a1b2_c3d4 = 0b 000000000_111111111_111101000_110110001_011000111101_000111010100
;                                       PML4=0    PDPT=511   PD=488     PT=433       offset=0x3d4
walk(va):
    cr3 = current_pgd                  # CR3 holds the physical addr of PML4
    pml4e = mem[cr3 + pml4_index(va)*8]
    if not pml4e.present: page_fault()
    pdpte = mem[pml4e.frame*4096 + pdpt_index(va)*8]
    if pdpte.huge_1g: return pdpte.frame*4096 + (va & 0x3FFFFFFF)
    pde = mem[pdpte.frame*4096 + pd_index(va)*8]
    if pde.huge_2m:  return pde.frame*4096 + (va & 0x1FFFFF)
    pte = mem[pde.frame*4096 + pt_index(va)*8]
    if not pte.present: page_fault()
    return pte.frame*4096 + (va & 0xFFF)

Each level entry is 8 bytes; each table is 4 KB (512 entries). A single PTE encodes the physical frame number plus permission bits — present, writable, user/supervisor, accessed, dirty, NX (no-execute), and PCD/PWT for cache attributes. The "Accessed" and "Dirty" bits are what the kernel uses to decide which pages are cold (eviction) and which need writeback (flush).

Paging vs segmentation vs combined

	Pure paging	Pure segmentation	Segments + paging	Inverted page table	Hugepages	Single-level
Unit	4 KB page	Variable segment	Both	Fixed page	2 MB / 1 GB page	Fixed page
Translation	4-level walk + TLB	Limit + base check	Segment + page walk	Hash probe	2- or 3-level walk	One indirection
Fragmentation	Internal only	External grows	Both	Internal only	Internal (worse)	Internal only
Sharing	Page	Segment (object)	Either	Page	Hugepage	Page
PT overhead	~0.2% of RAM	Tiny per segment	Sum	Linear in phys RAM	~512× less	Linear in addr space
Used by	Linux, Windows, macOS	OS/2 1.x, DOS extenders	x86 protected mode (legacy)	PowerPC, IA-64	Linux THP, DBs, JVMs	Toy / academic

Pure segmentation suffers external fragmentation and complicates sharing — variable sizes force the kernel to play Tetris with RAM. Paging trades that for bounded internal fragmentation (half-empty tail page per region). x86 historically did segmentation then paging because protected mode bolted onto a segmented model; long mode flattened segments, so modern systems run essentially "pure" paging with hugepages layered on.

C: probe page-table info from /proc

// Linux exposes the page table via /proc/self/pagemap (one 64-bit entry per page).
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/mman.h>

int main(void) {
    void *p = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0);
    *(volatile char*)p = 1;                 // force allocation

    int fd = open("/proc/self/pagemap", O_RDONLY);
    uint64_t vpn = (uintptr_t)p >> 12;       // virtual page number
    uint64_t entry;
    pread(fd, &entry, 8, vpn * 8);

    int present  = (entry >> 63) & 1;
    int swapped  = (entry >> 62) & 1;
    uint64_t pfn = entry & ((1ULL << 55) - 1);   // physical frame number
    printf("vpn=%llx pfn=%llx present=%d swap=%d\n", vpn, pfn, present, swapped);
}

Each /proc/self/pagemap entry tells you the physical frame, whether the page is resident, and whether it's been swapped. Combined with /proc/kpageflags you can see hugepages, NUMA nodes, and dirty bits — useful for diagnosing why a workload thrashes the TLB.

Python: madvise hugepages and faulting cost

import mmap, time

SIZE = 1 << 30                           # 1 GiB
m = mmap.mmap(-1, SIZE, mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS,
              mmap.PROT_READ | mmap.PROT_WRITE)
m.madvise(mmap.MADV_HUGEPAGE)            # request transparent hugepages
m.madvise(mmap.MADV_POPULATE_WRITE)      # prefault to avoid faults during the loop

t0 = time.perf_counter()
view = memoryview(m)
for off in range(0, SIZE, 4096):
    view[off] = 1                        # touch every page
print(f"touched {SIZE >> 30} GiB in {(time.perf_counter()-t0)*1e3:.1f} ms")

MADV_POPULATE_WRITE (Linux 5.14+) prefaults the entire mapping with write permissions in one call, avoiding millions of minor faults during the loop. MADV_HUGEPAGE tells the kernel "promote me to 2 MB pages if you can". On a workload that touches the whole region, hugepages typically cut walk cost by 4× and TLB miss rate by 100×.

Node.js: huge V8 heaps and TLB cost

// Run with: node --max-old-space-size=16384 --huge-max-old-generation-size big.js
const buf = Buffer.allocUnsafe(8 * 1024 * 1024 * 1024); // 8 GiB
let sum = 0;
const start = process.hrtime.bigint();
for (let off = 0; off < buf.length; off += 4096) sum += buf[off];
const ns = Number(process.hrtime.bigint() - start);
console.log(`scan: ${(ns / 1e6).toFixed(0)} ms`);

// Compare: same scan with --huge-max-old-generation-size enables hugepages on Linux,
// reducing the TLB-miss-driven cost of walking 2 million PTEs to 4 thousand.

V8 with the --huge-max-old-generation-size flag uses madvise(MADV_HUGEPAGE) on the old generation. On scans of multi-GB Node heaps the speedup is typically 1.1–1.3×; on memory-bandwidth-bound code (graph traversal, JSON serialization) the gain comes mostly from saved TLB walks rather than fewer cycles spent in the loop body.

Page-size and policy variants

4 KB base pages. The default everywhere. Each PTE covers 4 KB; a 4-level walk is 5 memory reads.
2 MB hugepages. One PTE covers 2 MB; the walk skips the bottom level. Each TLB entry now reaches 512× more memory.
1 GB gigantic pages. One PTE covers 1 GB; walk skips the bottom two levels. Reserved at boot via hugepages= on the kernel command line, used by HPC and DBs with tens of GB working sets.
Transparent Hugepages (THP). Kernel opportunistically promotes contiguous 4 KB pages to 2 MB. Fewer config knobs, occasional pause-time spikes from khugepaged defragmentation. Can be set always, madvise, or never.
NUMA-aware paging. On multi-socket systems the kernel tries to allocate a process's pages on the node where it runs (numa_balancing). Cross-socket access is 1.5–3× slower.
5-level paging. Linux supports a 57-bit virtual address space via a fifth table level on Ice Lake and later — useful for systems with >128 TB of memory.
Demand paging vs prefaulting. Default is lazy: pages fault on first touch. Latency-sensitive code prefaults with MAP_POPULATE, MADV_WILLNEED, or mlock.

Costed claims

4 KB page: 12 bits of offset, 36 bits of PFN on x86_64. PTE is 8 bytes, page table is 4 KB — fits exactly one page.
TLB reach: 64 dTLB entries × 4 KB = 256 KB. With 2 MB hugepages, 64 × 2 MB = 128 MB — a 512× increase per TLB entry.
Page-walk cost: 4 dependent memory reads, ~30 cycles best case (all in L1), 100+ cycles when the table itself is in L3 or DRAM.
Page-table size: ~0.2% overhead — a 1 GB process needs ~2 MB of page tables. Hugepages cut that to ~0.05%.
Minor fault: ~1 µs (allocate frame, install PTE, return). Major fault: ~100 µs on NVMe SSD, ~10 ms on spinning disk.
TLB shootdown: ~5–20 µs cross-core; on big systems with many CPUs, batched TLB invalidations are a measurable cost of memory unmapping.

Common bugs and edge cases

Page-fault storm. A process touches a huge mapping non-sequentially and faults on every page. Symptoms: high page_faults in /proc/PID/stat, low IPC. Fix: MAP_POPULATE, MADV_WILLNEED, or warm up sequentially.
TLB shootdown stalls. Unmapping or changing protections requires invalidating other CPUs' TLBs (an IPI). On 96-core boxes this can hit 10s of microseconds. Batched mprotect and mremap reduce the count.
THP latency tail. khugepaged running compaction can introduce stop-the-world stalls in the multi-millisecond range. Latency-critical apps switch THP to madvise or disable it.
NUMA placement bug. A worker thread allocates its pages from one socket then migrates; subsequent accesses cross sockets. Fix: numactl --membind or pin threads with numactl --cpunodebind.
Huge swap thrash. When RSS exceeds RAM the kernel evicts pages to swap; once you start hitting them again it spirals. vm.swappiness, zswap, and proper sizing matter.
32-bit address-space exhaustion. 3 GB of usable virtual memory; large mmaps fail with ENOMEM even with plenty of physical RAM. Ancient bug class still alive in embedded contexts.
Forgot to flush after PTE change. Kernel code that updates a PTE without invalidating the TLB sees stale translations — a very subtle source of memory corruption when writing kernel modules.

Paging

Interactive visualization

Watch the 60-second explainer

How paging translates an address

A 4-level walk, step by step

Paging vs segmentation vs combined

C: probe page-table info from /proc

Python: madvise hugepages and faulting cost

Node.js: huge V8 heaps and TLB cost

Page-size and policy variants

Costed claims

Common bugs and edge cases

Frequently asked questions

Interactive visualization

Watch the 60-second explainer

How paging translates an address

A 4-level walk, step by step

Paging vs segmentation vs combined

C: probe page-table info from /proc

Python: madvise hugepages and faulting cost

Node.js: huge V8 heaps and TLB cost

Page-size and policy variants

Costed claims

Common bugs and edge cases

Frequently asked questions

Related concepts