Computer Architecture

TLB (Translation Lookaside Buffer)

A 64-1024-entry CPU cache for page table walks — miss costs 100+ cycles

The TLB is a CPU-level cache that stores recent virtual-address-to-physical-address translations. On every memory access, the CPU first checks the TLB; on a hit (~99% with locality), translation is free (1 cycle); on a miss, the CPU walks the page table — 4 levels on x86-64, costing 100-300 cycles. Modern x86 has separate L1 ITLB (instruction, ~64 entries) and L1 DTLB (data, ~64 entries) plus an L2 STLB (~1500 entries). TLB shootdowns — when one core invalidates a mapping and signals others to flush via IPI — cost ~5-10 µs and are a major source of multi-core overhead in kernel code.

  • L1 entries64-128 per type
  • L2 STLB~1500
  • Hit latency1 cycle
  • Miss (page walk)100-300 cycles
  • Shootdown5-10 µs (IPI)
  • Hugepage support2 MB and 1 GB

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why TLB matters

  • Database scans. A sequential scan of a 100 GB table with 4 KB pages requires 25M unique translations. Even a perfect 1500-entry L2 STLB covers only 6 MB of working set; the rest miss continuously. Postgres, MySQL, and ClickHouse all benefit measurably from hugepages.
  • JVM heap walks. The OpenJDK G1 collector touches every reachable object during marking. On a 64 GB heap, a full mark with 4 KB pages does 16M page walks. Containers running JVMs frequently configure 2 MB hugepages explicitly to cut this cost.
  • HPC and ML. Dense matrix code, GPU host-side staging buffers, and PyTorch CUDA tensor allocations hit TLB pressure. Linux Transparent HugePages (THP) and explicit hugetlbfs commonly yield 5-20% speedups.
  • Container density. Each container has its own page tables; high container counts on shared hosts mean frequent CR3 switches and TLB invalidations. PCID (kernel 4.14+) and per-cpuset hugepage pools mitigate.
  • Side-channel security. Meltdown/L1TF mitigations forced KAISER/KPTI, separating user and kernel page tables. Each syscall now switches CR3 — without PCID, that flushes the TLB entirely. Performance hit was 5-30% on syscall-heavy workloads until PCID/INVPCID restored most of it.
  • Memory-mapped I/O. Files backed by mmap use page faults to lazily load and the TLB to translate. madvise(MADV_HUGEPAGE) tells the kernel to back this region with 2 MB pages where possible.

Anatomy of the modern TLB hierarchy

An Intel Sapphire Rapids core (representative of 2023-era x86) has:

  • L1 ITLB. 128 entries for 4 KB pages, 8 entries for 2 MB pages. Indexed in parallel with instruction fetch; ~1 cycle access.
  • L1 DTLB. 96 entries for 4 KB pages, 32 entries for 2 MB, 4 entries for 1 GB.
  • L2 STLB (Shared TLB). 2048 entries serving both instruction and data, mostly for 4 KB and 2 MB pages.
  • Paging-structure caches. Internal caches of intermediate page-walk results (PML4, PDPT, PD entries) so a partial walk avoids redoing the upper levels.

AMD Zen 4 has slightly different sizes: L1 DTLB 72 entries, L2 STLB 3072 entries. Apple Silicon (M-series) is comparable but secrecy precludes exact figures.

The 4-level page walk

x86-64 with 4-level paging splits a 48-bit virtual address into:

  • Bits 47-39: PML4 index (9 bits → 512 entries).
  • Bits 38-30: PDPT index.
  • Bits 29-21: Page Directory index.
  • Bits 20-12: Page Table index.
  • Bits 11-0: Page offset (4 KB page).

The walk: CR3 → PML4[i1] → PDPT[i2] → PD[i3] → PT[i4] → physical page frame. Each step loads an 8-byte entry from the next-level table. With cold caches, that's 4 memory accesses; with paging-structure cache hits, 1-2. Typical observed cost: 100-300 cycles per miss on modern x86. 5-level paging (introduced in Ice Lake / Sapphire Rapids server) extends to 57-bit virtual addresses with one more level.

Hugepages and TLB reach

"TLB reach" is the working set covered by a fully-occupied TLB. Math:

  • L2 STLB 2048 entries × 4 KB = 8 MB reach.
  • L2 STLB 2048 entries × 2 MB = 4 GB reach.
  • L1 DTLB 4 entries × 1 GB = 4 GB reach (plus all 2 MB and 4 KB tracked in STLB).

For a database with a 64 GB buffer pool:

  • 4 KB pages: TLB covers 0.012% of working set; ~99% miss rate during scans.
  • 2 MB pages: TLB covers 6.25%; miss rate falls to 5-15%.
  • 1 GB pages: TLB covers 100%+; near-zero misses.

Configuration: Linux echo 1024 > /proc/sys/vm/nr_hugepages reserves 1024 × 2 MB = 2 GB of hugepages. Postgres maps shared_buffers atop them via huge_pages = on. JVM uses -XX:+UseLargePages. madvise(MADV_HUGEPAGE) on a mapped region.

TLB shootdowns and IPI overhead

The shootdown problem: when CPU 0 unmaps a page (via munmap, mprotect, or page table edit), every CPU that has cached the old translation must invalidate. Typical sequence:

  1. CPU 0 modifies the page table entry.
  2. CPU 0 sends an IPI (Inter-Processor Interrupt) to all CPUs in the process's CPU mask.
  3. Each receiving CPU executes INVLPG or full TLB flush, ACKs.
  4. CPU 0 waits for all ACKs before continuing.

Cost: 5-10 µs per shootdown round-trip; scales with CPU count. A 64-core machine with frequent munmaps (e.g. JVM compaction, JIT code unloading, mmap-heavy DBs) can lose 10-30% throughput. Linux mitigations: batched shootdowns (one IPI for many invalidations), lazy TLB on idle CPUs, RCU-deferred page table freeing.

PCID/ASID and context switches

Without PCID, every CR3 write (i.e. every context switch) flushes the entire TLB. With PCID enabled (Linux 4.14+), CR3 writes use the No-Flush bit, retaining entries tagged for other processes. The CPU only matches entries whose PCID equals the current process's PCID.

Limits: x86 PCID field is 12 bits (4096 unique PCIDs); Linux uses a per-CPU mod-12 hash mapping process IDs to PCIDs, so frequent process turnover still causes occasional flush. INVPCID instruction (Haswell 2013+) lets the kernel selectively invalidate one PCID's entries without disturbing the current PCID.

Post-Meltdown KPTI (Kernel Page Table Isolation) doubles the CR3 churn — every syscall switches between user and kernel page tables. Without PCID this would devastate syscall throughput; with PCID, the cost drops to ~3-7%.

Measurement

  • perf stat. perf stat -e dTLB-load-misses,iTLB-load-misses ./binary reports miss counts. Compare to total memory accesses for miss rate.
  • perf record on dTLB-load-misses. Samples high-miss instructions for source-level attribution.
  • cat /proc/pid/smaps. Reports per-mapping page size; AnonHugePages field shows THP usage.
  • Intel VTune Memory Access analysis. Surfaces TLB miss as a top-level bottleneck category with line-by-line attribution.
  • Linux numastat / numactl. NUMA misses compound TLB misses by adding cross-socket page-walk latency.

Common misconceptions

  • "TLB is a small irrelevant cache." TLB pressure is a top-3 server bottleneck after L3 misses and branch mispredictions. Modern profilers (VTune Top-Down) consistently flag iTLB/dTLB miss as ≥5% of pipeline stalls on database, JVM, and container workloads.
  • "Context switches always flush TLB." Only without PCID. Linux 4.14+ retains other processes' entries via PCID; switches are dramatically cheaper. Older kernels and architectures without PCID/ASID still flush.
  • "All cores share TLB." No — TLB is strictly per-core. Cross-core invalidation requires explicit IPI shootdowns. This is why cross-NUMA TLB misses can cascade across the machine.
  • "Hugepages are always faster." Sparse access patterns can waste memory: a single byte read in a 1 GB page allocates 1 GB of physical memory. Workloads with random access over a tiny subset of huge data can lose to 4 KB pages. THP also stalls fork()/copy-on-write because copying 2 MB pages is more expensive.
  • "x86-64 always uses 4-level paging." Recent Intel (Ice Lake+) and AMD (Zen 4+) support 5-level paging for 57-bit virtual addresses, used in HPC and large-memory cloud instances. 4-level remains the default.
  • "Userspace can't influence TLB." madvise(MADV_HUGEPAGE), posix_memalign with hugepage alignment, and explicit hugetlbfs mounts give userspace direct levers. Database tuning guides routinely require sysadmin actions for hugepage reservation.

Practical impact

  • Reserve hugepages on database hosts. Postgres, MySQL, ClickHouse, MongoDB all gain 5-20% on large working sets.
  • Pin JVMs to NUMA nodes with numactl --cpunodebind=0 --membind=0; cross-node TLB misses are catastrophic.
  • Verify PCID on guest VMs. Check cat /proc/cpuinfo | grep pcid; old hypervisors have hidden it. Without PCID, KPTI overhead can push 30%.
  • Batch munmap calls. One large munmap is one shootdown; ten small ones are ten shootdowns.
  • Consider memory-mapped I/O carefully. mmap of large files, especially with random access, compounds TLB and page cache pressure; large reads with read() avoid the TLB tax for streaming workloads.

Frequently asked questions

What is a TLB miss?

A TLB miss occurs when the CPU needs to translate a virtual address but no matching entry exists in the TLB. The CPU must then walk the page table in main memory to find the translation, which on x86-64 is a 4-level walk costing 4 cache-line accesses (or 4 memory loads if those entries are not cached). Total cost: 100-300 cycles, equivalent to 30-100 nanoseconds. The translation is then inserted into the TLB for future use. TLB miss rate above 1% on a working set is a serious performance bottleneck; databases and HPC applications have entire optimizations dedicated to TLB pressure.

How does a page table walk work on x86-64?

x86-64 uses 4-level paging (or 5-level on recent server chips). The 48-bit virtual address splits into four 9-bit indexes plus a 12-bit offset. The CR3 register points to the top-level page directory (PML4). Walk: index PML4 with bits 47-39 to find PDPT physical address; index PDPT with bits 38-30 to find PD; index PD with bits 29-21 to find PT; index PT with bits 20-12 to find the physical page frame. Add the 12-bit offset for the final physical address. Each of the 4 lookups is a separate memory access; modern CPUs have a paging-structure cache to accelerate this.

What is a TLB shootdown?

When a process or kernel changes a page table entry (e.g. unmaps memory via munmap, or changes permissions via mprotect), every CPU that has the old translation cached in its TLB must invalidate that entry. The kernel sends an Inter-Processor Interrupt (IPI) to each affected CPU, asking it to execute INVLPG or full TLB flush. Cost: 5-10 microseconds per shootdown round-trip on modern x86. Heavily-multi-threaded workloads (containers, JVM apps, databases) can lose 10-30% performance to shootdown storms; mitigations include batching unmaps, lazy TLB shootdowns, and per-CPU page tables.

How do hugepages reduce TLB pressure?

x86-64 supports 4 KB, 2 MB, and 1 GB page sizes. A 2 MB page covers 512× more memory per TLB entry than a 4 KB page; a 1 GB page covers 262144×. Database scans on a 100 GB working set with 4 KB pages need 25M TLB entries (vastly exceeds even L2 STLB at ~1500); with 1 GB pages, just 100 entries. Linux Transparent HugePages (THP) automatically promotes contiguous 4K pages to 2 MB pages; explicit hugetlbfs gives manual control. Trade-offs: hugepages waste memory on sparse access patterns and complicate copy-on-write.

Why ASIDs/PCIDs avoid full flushes on context switch?

Without ASIDs (Address Space Identifiers), every context switch must flush the entire TLB because the same virtual address means different physical addresses for different processes. With PCID (x86) or ASID (ARM, MIPS, etc), each process gets a tag stored alongside each TLB entry; the CPU only matches entries with the current PCID. Switching CR3 with the NoFlush bit avoids the full flush, retaining other processes' entries for when they're scheduled again. Linux enabled PCID by default in kernel 4.14 (2017), saving 5-15% on syscall-heavy workloads after Meltdown mitigations.

What is INVLPG?

INVLPG (Invalidate Page) is the x86 instruction that removes a single virtual address's translation from the TLB. Used by the kernel after modifying a page table entry to ensure stale translations don't persist. INVLPG costs ~100-300 cycles per invocation and only invalidates on the executing CPU; cross-CPU invalidation requires an IPI shootdown. INVPCID (introduced in Haswell 2013) extends this to invalidate translations for a specific PCID, avoiding side effects on the current process's TLB. ARM's equivalent is TLBI; RISC-V uses SFENCE.VMA.