Systems

Transparent Huge Pages

One TLB entry that maps 512 pages — automatically, with real tradeoffs

Transparent Huge Pages (THP) is a Linux memory feature that automatically backs process memory with 2MB pages instead of 4KB, shrinking page-table walks and cutting TLB misses by up to 512× per entry — at the cost of allocation latency, defrag stalls, and internal fragmentation.

  • Base page (x86-64)4 KB
  • Huge page (PMD)2 MB
  • 4KB pages per huge page512
  • TLB reach gainup to 512×
  • Modesalways / madvise / never
  • IntroducedLinux 2.6.38 (2011)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why pages exist, and why 4KB hurts

Every memory access your program makes uses a virtual address. The CPU has to translate that to a physical address before it can touch RAM, and it does so by walking a tree of page tables. On x86-64 with 4-level paging, the smallest unit of mapping is the 4KB page, and a full translation costs four memory loads — one per level of the tree.

Four loads per access would be ruinous, so the CPU caches recent translations in the TLB (Translation Lookaside Buffer). A TLB hit makes translation effectively free; a TLB miss triggers a hardware page-table walk. The catch is that the TLB is tiny — a modern core has on the order of 64 entries in its L1 data TLB and roughly 1024–1536 in its L2 (unified) TLB. With 4KB pages, 1536 entries map only 1536 × 4KB = 6MB of memory. Any working set larger than 6MB starts missing the TLB constantly, and each miss costs tens to hundreds of cycles.

The fix is a bigger page. If one page is 2MB, then one TLB entry maps 2MB, and the same 1536 entries now reach 1536 × 2MB = 3GB — a 512× increase in TLB reach. The page-table walk also gets shorter: a 2MB page is mapped directly at the PMD level (the third of four levels), so a walk that misses the TLB does three loads instead of four. THP is Linux's way of getting that benefit without rewriting the application.

The mechanism: promotion, collapse, and split

Andrea Arcangeli merged THP into Linux 2.6.38 in 2011. The "transparent" part means the kernel manufactures and dismantles huge pages on your behalf through three operations:

  • Fault-time promotion. When a process faults on a fresh anonymous mapping and the region is at least 2MB and 2MB-aligned, the page-fault handler tries to allocate one physically-contiguous 2MB page directly, installing a single PMD entry instead of a 4KB PTE. This is the cheapest path — no later collapse needed.
  • Background collapse via khugepaged. Memory that ended up as 4KB pages anyway (because no contiguous 2MB block was free at fault time) gets a second chance. The kernel thread khugepaged periodically scans address spaces for a run of 512 contiguous, present 4KB pages and collapses them: it allocates a fresh 2MB page, copies the 512 pages into it, and rewrites the page table to point at the huge page.
  • Splitting. A huge page is not forever. If the kernel needs to swap part of it, change protection on a sub-range, or hand a CoW copy to a child process, it splits the 2MB page back into 512 individual 4KB PTEs so it can operate at fine granularity again.

The arithmetic that governs all of this: on x86-64, 2MB ÷ 4KB = 512 base pages per huge page. The page must be naturally aligned — its physical and virtual start addresses are multiples of 2MB. That alignment requirement is the whole reason THP can be slow: the page allocator has to produce a contiguous, aligned 2MB chunk (an "order-9" block, since 2⁹ = 512), and on a fragmented system that may require compaction.

When THP helps and when it hurts

THP is a near-free win when:

  • The working set is large and densely accessed. In-memory analytics, scientific simulation, JVM heaps, and large hash tables all sweep gigabytes and thrash a 4KB TLB. They are exactly the workloads that benefit.
  • Memory is allocated in big, long-lived chunks and touched fully. A 4GB tensor that lives for the duration of training is an ideal candidate.
  • You are not latency-sensitive at the tail. If you care about throughput, not p99.9 jitter, the occasional defrag stall is amortized away.

THP hurts when:

  • You fork-and-CoW heavily. Redis BGSAVE, MongoDB, and Postgres all snapshot via fork. A 4KB write now dirties a 2MB page, so the kernel copies 512× as much memory during the save.
  • You need predictable tail latency. Synchronous compaction to assemble a 2MB block can stall a thread for milliseconds. Latency-sensitive services (low-latency trading, real-time media) are the classic THP victims.
  • Your heap is sparse. Pointer-chasing structures that touch one cache line per 2MB region pay full 2MB resident cost for 64 bytes of useful data — runaway internal fragmentation.

THP vs explicit huge pages vs 4KB

4KB base pagesTransparent Huge Pageshugetlbfs (explicit)
Page size4 KB2 MB (anonymous; mTHP adds intermediate sizes)2 MB or 1 GB, fixed
Application changesnonenone (or one madvise call)mmap a reserved pool / SHM_HUGETLB
Allocationon demand, never fails for sizeopportunistic, falls back to 4KBpre-reserved at boot, fails if pool empty
Can be swappedyesyes (splits first)no — pinned in RAM
TLB reach (1536-entry L2 TLB)6 MBup to 3 GB3 GB (2MB) / 1.5 TB (1GB)
Latency predictabilityhighlow — defrag/compaction stallshigh — no runtime allocation
Internal fragmentationminimalup to ~2 MB per sparse mappingup to page size, by design
Best fortiny/sparse/latency-critical heapslarge dense heaps, hands-off tuningdatabases, VMs, deterministic HPC

The headline difference is who decides. With hugetlbfs you decide, up front, and pay for determinism with rigidity (the pool is reserved whether you use it or not). With THP the kernel decides, opportunistically, and you trade determinism for zero configuration. 4KB is the safe default that never surprises you and never speeds you up.

What the numbers actually say

  • 512× TLB reach. One PMD entry maps 2MB; 512 PTEs map the same range. A 1536-entry L2 TLB covers 6MB of 4KB pages but 3GB of huge pages.
  • 3-level walk instead of 4. A huge page is installed at the PMD level, so a TLB miss costs 3 dependent loads (~3 × 100ns worst case off-package) rather than 4. The bigger win is simply missing far less often.
  • 5–30% throughput on TLB-bound code. Published measurements on memory-intensive HPC and database benchmarks routinely land in this band; some pointer-light streaming kernels see more, latency-bound code sees none.
  • 512× CoW write amplification. Under fork-based snapshotting, one 4KB store forces a 2MB copy: 2MB ÷ 4KB = 512× the data movement per dirtied page during the save window.
  • Millisecond defrag stalls. Synchronous compaction to free a contiguous 2MB block has been measured to add multi-millisecond pauses to individual allocations — catastrophic for a service with a 1ms p99 budget.
  • khugepaged default cadence. Scans pages_to_scan (default 4096) per pass, sleeping scan_sleep_millisecs (default 10000ms) between passes — so collapse is gentle, not instantaneous.

Modeling TLB reach in JavaScript

You can't allocate huge pages from JavaScript, but the cost model is the same arithmetic the kernel reasons about. This simulates TLB reach and the miss rate for a given working set under each page size:

const KB = 1024, MB = 1024 * KB;

// A simple capacity-TLB model: if the working set fits in TLB reach,
// near-zero misses; beyond it, miss rate scales with the overflow.
function tlbModel({ workingSet, pageSize, tlbEntries }) {
  const reach = pageSize * tlbEntries;          // bytes the TLB can map
  const pagesTouched = Math.ceil(workingSet / pageSize);
  // fraction of touched pages that cannot be resident in the TLB
  const missRate = pagesTouched <= tlbEntries
    ? 0
    : (pagesTouched - tlbEntries) / pagesTouched;
  return { reach, pagesTouched, missRate };
}

const tlbEntries = 1536;                         // typical L2 dTLB
const workingSet = 2 * 1024 * MB;                // 2 GB hot region

const small = tlbModel({ workingSet, pageSize: 4 * KB,  tlbEntries });
const huge  = tlbModel({ workingSet, pageSize: 2 * MB,  tlbEntries });

console.log('4KB  reach:', (small.reach / MB).toFixed(1), 'MB,',
            'miss rate:', (small.missRate * 100).toFixed(1) + '%');
console.log('2MB  reach:', (huge.reach  / MB).toFixed(1), 'MB,',
            'miss rate:', (huge.missRate  * 100).toFixed(1) + '%');
console.log('reach ratio:', huge.reach / small.reach + '×');
// 4KB  reach: 6.0 MB,  miss rate: 99.7%
// 2MB  reach: 3072.0 MB, miss rate: 0.0%
// reach ratio: 512×

The model captures the core intuition: a 2GB hot region misses the 4KB TLB almost every access (only 6MB of it can be resident at once), but fits entirely inside huge-page TLB reach. The 512× ratio falls straight out of 2MB ÷ 4KB.

Driving THP from Python (Linux)

On real Linux you control THP through /sys and, per-mapping, through madvise. Here is how to read the global mode, and how to opt a specific allocation in or out:

import ctypes, mmap, os

SYS = "/sys/kernel/mm/transparent_hugepage/"

def thp_mode():
    # File looks like: "always [madvise] never" — brackets mark the active mode
    with open(SYS + "enabled") as f:
        line = f.read()
    return line.split("[")[1].split("]")[0]

def thp_stats():
    # Count THP-backed bytes for the *current* process
    total = 0
    with open(f"/proc/{os.getpid()}/smaps") as f:
        for line in f:
            if line.startswith("AnonHugePages:"):
                total += int(line.split()[1])  # KiB
    return total  # KiB currently backed by huge pages

# madvise(2) advice constants (Linux asm-generic)
MADV_HUGEPAGE   = 14   # opt this range INTO THP (used in 'madvise' mode)
MADV_NOHUGEPAGE = 15   # opt this range OUT, even in 'always' mode

libc = ctypes.CDLL("libc.so.6", use_errno=True)

def advise(buf, advice):
    addr = ctypes.addressof((ctypes.c_char * len(buf)).from_buffer(buf))
    # round down to page boundary; madvise needs page-aligned start
    page = mmap.PAGESIZE
    start = addr & ~(page - 1)
    length = len(buf) + (addr - start)
    if libc.madvise(ctypes.c_void_p(start), ctypes.c_size_t(length), advice) != 0:
        raise OSError(ctypes.get_errno(), "madvise failed")

print("global THP mode:", thp_mode())

# Allocate a 64 MB region and explicitly request huge-page backing
region = mmap.mmap(-1, 64 * 1024 * 1024)
advise(region, MADV_HUGEPAGE)
region.write(b"\x00" * len(region))   # fault it in so promotion can happen
print("AnonHugePages:", thp_stats(), "KiB")

The key idea is the madvise handshake. In madvise mode the kernel only promotes regions you flagged with MADV_HUGEPAGE, so a memory allocator (jemalloc, tcmalloc, the JVM) can opt its big arenas in while leaving small allocations on 4KB. Conversely, MADV_NOHUGEPAGE lets a latency-critical buffer escape THP even when the global mode is always.

Variants and related knobs worth knowing

1GB huge pages (PUD-level). On CPUs with the pdpe1gb feature, the page table can map a full 1GB at the PUD level. THP historically only ever produced 2MB pages for anonymous memory; 1GB pages are practically only available through hugetlbfs. They give absurd TLB reach (1.5TB with 1536 entries) but coarse, wasteful granularity.

Multi-size THP (mTHP). Recent kernels (6.x) generalize THP beyond a single 2MB size to a range of intermediate orders — for example 16KB, 64KB, or 1MB "medium" pages — so the kernel can pick a size that balances TLB reach against fragmentation instead of an all-or-nothing 2MB jump. Controlled per-size under /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/.

File-backed and shmem THP. THP began as anonymous-only, but the kernel later gained huge-page support for tmpfs/shmem and, more recently, for the page cache of regular files, so executables and mapped files can also benefit from 2MB TLB entries.

The defrag knob. Separate from enabled, the file /sys/kernel/mm/transparent_hugepage/defrag controls how hard the kernel works to assemble a contiguous block: always (synchronous, can stall), defer (kick kswapd/kcompactd and use 4KB now), defer+madvise, madvise, or never. Most latency tuning happens here, not in enabled.

Common pitfalls and surprises

  • Blaming THP for "phantom" memory growth. AnonHugePages in /proc/PID/smaps rounds resident set up to 2MB units; a process can look like it grew when it only touched one new huge-page region.
  • Latency spikes traced to synchronous defrag. The symptom is sporadic multi-millisecond stalls under memory fragmentation. The fix is usually the defrag knob (set to defer or madvise), not disabling THP wholesale.
  • CoW write amplification after fork. Databases that fork to snapshot see write storms because a 4KB store dirties a whole 2MB page; this is the #1 reason Redis/Mongo/Postgres ship "disable THP" in their docs.
  • Expecting 1GB pages from THP. THP for anonymous memory produces 2MB pages; if you need 1GB pages you must use hugetlbfs with an explicit reservation.
  • Forgetting madvise must be page-aligned. madvise(MADV_HUGEPAGE) on a non-page-aligned start address returns EINVAL; round the start down and extend the length.
  • Measuring on a freshly-booted box. Right after boot, memory is unfragmented and THP allocation always succeeds, so benchmarks look great. The fragmentation-driven stalls only appear after the system has been churning memory for hours — test under realistic uptime.

Frequently asked questions

What is the difference between transparent huge pages and hugetlbfs?

hugetlbfs is explicit: you reserve a fixed pool of huge pages at boot and the application maps them on purpose, so allocation never fails silently and the memory is never swapped or split. THP is automatic: the kernel promotes ordinary anonymous memory to 2MB pages opportunistically at fault time or in the background via khugepaged, with no application changes — but it can also fall back to 4KB, split pages under memory pressure, and add unpredictable latency.

How much can transparent huge pages reduce TLB misses?

One 2MB huge-page TLB entry covers the same address range as 512 separate 4KB entries (2MB ÷ 4KB = 512). So a working set that needed 512 TLB entries can fit in one, and a typical 1536-entry data TLB can map 3GB with huge pages versus only 6MB with 4KB pages — a 512× increase in TLB reach. Workloads that thrash the TLB, like large in-memory databases and HPC kernels, can see 5–30% throughput gains.

Why do Redis, MongoDB, and Postgres recommend disabling THP?

These databases fork to snapshot memory and rely on copy-on-write. With THP a single 4KB write touches a 2MB page, so the kernel must copy the whole 2MB — 512× the write amplification during a save. THP's synchronous defragmentation can also stall a thread for milliseconds while it compacts memory to assemble a 2MB block, producing latency spikes. The official guidance is to set THP to 'never' or 'madvise' rather than 'always'.

What do the always, madvise, and never THP modes mean?

'always' promotes every eligible mapping to huge pages automatically — maximum TLB benefit, maximum risk of latency spikes. 'never' disables THP entirely; only explicit hugetlbfs huge pages remain. 'madvise' is the middle ground: THP is used only for memory the application explicitly opts into with madvise(MADV_HUGEPAGE). Most production guidance favors 'madvise' so latency-sensitive code stays on 4KB while heavy allocators opt in.

What is khugepaged and what does it do?

khugepaged is a background kernel thread that scans process address spaces looking for runs of 512 contiguous, present 4KB pages it can 'collapse' into a single 2MB huge page. It runs periodically (default: scan up to 4096 pages every 10 seconds), so memory that wasn't huge at fault time becomes huge later without the application doing anything. The collapse itself is a brief stop-the-world for that mapping while page tables are rewritten.

Why might transparent huge pages waste memory?

THP rounds allocations up to 2MB. If a process touches one byte of a fresh region, the kernel may hand it a full 2MB page, wasting up to 2MB minus one byte — internal fragmentation. Sparse, pointer-heavy heaps that touch scattered cache lines across many pages are the worst case; THP can inflate their resident set substantially, which is exactly why memory-frugal services often turn it off.