Computer Architecture
The Translation Lookaside Buffer (TLB)
The 64-entry cache that stands between every memory access and a 4-level page walk
The translation lookaside buffer (TLB) is a small, fast cache of recent virtual-to-physical page translations that lets the CPU skip a multi-level page-table walk — turning a 100–300-cycle lookup into a 1-cycle hit on every memory access.
- Hit latency≈ 1 cycle
- Miss latency (uncached walk)100–300 cycles
- L1 TLB size≈ 64 entries
- L2 STLB size1,024–2,048 entries
- Typical hit rate> 99%
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why the TLB exists
Your program never touches a real memory address. It works in virtual addresses — a private, contiguous space that the operating system maps onto scattered physical frames of DRAM. That mapping lives in the page table, and on every single load, store, and instruction fetch the CPU has to translate the virtual page into a physical frame before it can talk to memory. There is no shortcut at the hardware level: no translation, no access.
The problem is that reading the page table is itself a memory operation — and on a 64-bit machine the page table is a tree, not a flat array. On x86-64 with 4 KB pages it is four levels deep: PML4 → PDPT → PD → PT. Walking it means four dependent DRAM references just to discover where one byte physically lives. If each reference misses the data caches, that is 100–300 cycles of pure overhead on an access that should take a handful of cycles. Doing that on every instruction would make virtual memory unusable.
The translation lookaside buffer is the fix. It is a tiny, dedicated cache inside the memory management unit (MMU) that remembers the results of recent page-table walks. Because programs exhibit strong locality — they hammer the same few pages over and over — a 64-entry TLB serves well over 99% of accesses from a single fast lookup. The expensive page walk only happens on the rare miss. The idea dates back to the IBM System/360 Model 67 (1967) and was central to making practical virtual memory ship.
The mechanism: splitting the address
A virtual address divides cleanly into two parts: the high bits are the virtual page number (VPN), and the low bits are the page offset. With 4 KB pages the offset is the bottom 12 bits (because 212 = 4096), and everything above is the page number. The offset is never translated — byte 37 of a virtual page is byte 37 of the corresponding physical frame. Only the page number gets mapped.
63 12 11 0
+----------------------------------+--------------+
| Virtual Page Number (VPN) | Page Offset |
+----------------------------------+--------------+
| |
translated by TLB passes through
v |
+----------------------------------+ |
| Physical Frame Number (PFN) | |
+----------------------------------+------v-------+
| PFN | Page Offset | = physical address
+----------------------------------+--------------+
The TLB is content-addressed by the VPN. On each access the MMU presents the VPN; if a valid entry matches, the TLB returns the physical frame number (PFN) plus the page's permission bits, and the offset is concatenated on to form the physical address. This is overlapped with the L1 cache lookup — in a virtually-indexed, physically-tagged (VIPT) L1 cache the index bits come from the untranslated offset, so the cache set can be selected while the TLB resolves the tag. Translation is effectively free on a hit.
On a miss, the hardware page walker (on x86 and ARM) or a software handler (on classic MIPS/SPARC) performs the walk, then refills the TLB with the new entry, evicting a victim. The effective cost of address translation is just the average-memory-access-time formula applied to the TLB:
EAT = hit_rate × hit_cost + miss_rate × walk_cost
# Example: 99% hit, 1-cycle hit, 200-cycle walk
EAT = 0.99 × 1 + 0.01 × 200 = 0.99 + 2.0 = 2.99 cycles/translation
# Drop to 95% hit and the picture changes:
EAT = 0.95 × 1 + 0.05 × 200 = 0.95 + 10.0 = 10.95 cycles/translation
That non-linearity is the whole story: a 4-point drop in hit rate nearly quadruples translation cost. TLB performance is dominated by the tail.
Structure: set-associativity and reach
Like any hardware cache, a TLB has a fixed number of entries organized into sets. An entry's set is chosen by the low bits of the VPN; within a set the entry may sit in any way. L1 TLBs are often fully associative (any VPN can go anywhere) because they are small enough that a parallel compare across all 64 entries is cheap; the larger L2 STLB is typically 8- or 12-way set-associative to keep the comparison tractable.
The figure of merit is TLB reach — the total memory the TLB can map without a miss:
reach = entries × page_size
64 entries × 4 KB = 256 KB # easily blown past by any real workload
64 entries × 2 MB = 128 MB # one huge-page entry per 512 small pages
64 entries × 1 GB = 64 GB # gigantic pages for VMs and databases
This is why a database scanning a 100 GB table thrashes a 4 KB-page TLB relentlessly — its working set needs ~25 million translations, but only 64 fit. Switch it to 2 MB huge pages and the same data needs ~50,000 translations, of which the hot ones comfortably fit in the L2 STLB. The TLB never changed; the granularity did.
When the TLB matters (and when it doesn't)
- Large working sets with poor locality — graph analytics, in-memory databases, sparse linear algebra. These touch many pages with little reuse, so the TLB misses constantly and page walks dominate.
- Pointer-chasing data structures — hash tables, trees, linked lists spread across the heap. Each dereference can land on a fresh page.
- Virtual machines — under nested paging the guest and host page tables are walked, so a miss can cost a 2-dimensional walk of up to 24 references. TLB pressure is amplified.
It matters far less for tight loops over small arrays (a few pages, near-100% hit rate) or for code dominated by compute rather than memory traffic. If perf stat shows your dTLB miss rate is a fraction of a percent and walk cycles are a rounding error, the TLB is not your bottleneck — look at the data caches instead.
TLB vs the data cache vs the page table
| TLB | L1 data cache | Page table | |
|---|---|---|---|
| What it stores | VPN → PFN + permission bits | Memory contents (cache lines) | The complete VPN → PFN mapping |
| Keyed by | Virtual page number | Physical address (VIPT) | Indexed by VPN slices per level |
| Lives in | MMU (dedicated SRAM) | On-die SRAM | DRAM (cacheable in L1/L2/L3) |
| Size | 64 (L1) / ~1–2K (L2) entries | 32–48 KB | Up to gigabytes |
| Hit cost | ≈ 1 cycle | 4–5 cycles | n/a — it is the slow path |
| Miss cost | 100–300 cycle page walk | Next-level cache / DRAM | n/a |
| Coherence | Software-managed (INVLPG, shootdowns) | Hardware cache-coherence protocol | OS owns it |
| Flushed on context switch? | Yes, unless PCID/ASID-tagged | No (physically tagged) | No — only the base register swaps |
The key distinction people miss: the data caches and the TLB are queried in parallel, not in sequence. The TLB caches where data lives; the data cache stores the data itself. A TLB hit followed by an L1 miss is common and cheap; a TLB miss followed by an L1 hit on the page-table memory is the lucky case where the walk's references are themselves cached.
What the numbers actually say
- A miss costs ~100–300 cycles. At 3 GHz that is roughly 33–100 ns per uncached page walk. Three of the four levels are usually cached, so a "warm" walk often lands nearer 30–50 cycles, but cold walks dominate the average.
- The tail dominates. Going from 99% to 95% hit rate raises translation cost from ~3 to ~11 cycles per access — a 3.7× regression from a seemingly small change, because misses are 200× more expensive than hits.
- Huge pages multiply reach by 512×. A 4 KB → 2 MB switch turns a 256 KB TLB reach into 128 MB with the identical 64 entries. Real databases (PostgreSQL, MySQL, ClickHouse) and the JVM (
-XX:+UseLargePages) report 5–20% throughput gains on TLB-bound workloads. - Page-walk caches help. Modern CPUs add small caches for the upper page-table levels so a miss usually only re-reads the last level or two, not all four — cutting typical miss cost roughly in half.
- Nested virtualization is brutal. A guest miss under two-dimensional paging can require up to 24 memory references (the famous "5×5 minus 1" worst case), making TLB misses several times costlier inside a VM.
JavaScript model of a set-associative TLB
Here is a working LRU set-associative TLB simulator. It splits the address, looks up the set, and on a miss invokes a page-walk callback and refills the least-recently-used way. It tracks hit rate so you can see the EAT effect live.
class TLB {
// sets × ways entries; offsetBits sets the page size (12 = 4 KB).
constructor({ sets = 16, ways = 4, offsetBits = 12, pageWalk }) {
this.sets = sets; this.ways = ways;
this.offsetBits = offsetBits;
this.setBits = Math.log2(sets);
this.pageWalk = pageWalk; // (vpn) => pfn, the slow path
this.store = Array.from({ length: sets }, () => []); // each: [{vpn, pfn, used}]
this.clock = 0;
this.hits = 0; this.misses = 0;
}
vpnOf(addr) { return Math.floor(addr / (1 << this.offsetBits)); }
offsetOf(addr) { return addr & ((1 << this.offsetBits) - 1); }
setOf(vpn) { return vpn & (this.sets - 1); }
translate(addr) {
const vpn = this.vpnOf(addr);
const set = this.store[this.setOf(vpn)];
const entry = set.find(e => e.vpn === vpn);
this.clock++;
if (entry) { // ---- TLB HIT ----
entry.used = this.clock; // touch for LRU
this.hits++;
return entry.pfn * (1 << this.offsetBits) + this.offsetOf(addr);
}
// ---- TLB MISS: walk the page table, then refill ----
this.misses++;
const pfn = this.pageWalk(vpn);
if (set.length < this.ways) {
set.push({ vpn, pfn, used: this.clock });
} else { // evict the LRU way
let lru = 0;
for (let i = 1; i < set.length; i++) if (set[i].used < set[lru].used) lru = i;
set[lru] = { vpn, pfn, used: this.clock };
}
return pfn * (1 << this.offsetBits) + this.offsetOf(addr);
}
flush() { this.store = this.store.map(() => []); } // e.g. context switch
hitRate() { return this.hits / (this.hits + this.misses); }
}
// Identity-ish page walk for the demo: vpn maps to frame vpn+0x100.
const tlb = new TLB({ sets: 16, ways: 4, pageWalk: vpn => vpn + 0x100 });
// Stride access that fits in reach vs one that thrashes it:
for (let i = 0; i < 1000; i++) tlb.translate((i % 32) * 4096); // 32 pages, 64 slots
console.log(tlb.hitRate().toFixed(3)); // ~0.968 — almost all hits after warm-up
Two details worth flagging. First, the offset is masked out and re-attached untouched — the TLB never sees or changes it. Second, flush() models a context switch on a TLB without address-space tags; with PCID/ASID you would instead key entries by (asid, vpn) and never flush on switch.
Python: measuring the EAT cliff
This Python version reuses the same structure and adds a famous experiment learners search for — sweeping the working-set size to find the point where it exceeds TLB reach and the hit rate falls off a cliff.
from math import log2
class TLB:
def __init__(self, sets=16, ways=4, offset_bits=12, page_walk=None):
self.sets, self.ways = sets, ways
self.offset_bits = offset_bits
self.page_walk = page_walk or (lambda vpn: vpn + 0x100)
self.store = [[] for _ in range(sets)] # each way: [vpn, pfn, used]
self.clock = self.hits = self.misses = 0
def _vpn(self, a): return a >> self.offset_bits
def _offset(self, a): return a & ((1 << self.offset_bits) - 1)
def _set(self, vpn): return vpn & (self.sets - 1)
def translate(self, addr):
vpn = self._vpn(addr)
ways = self.store[self._set(vpn)]
self.clock += 1
for w in ways:
if w[0] == vpn: # HIT
w[2] = self.clock
self.hits += 1
return (w[1] << self.offset_bits) | self._offset(addr)
# MISS: walk + refill (LRU eviction)
self.misses += 1
pfn = self.page_walk(vpn)
if len(ways) < self.ways:
ways.append([vpn, pfn, self.clock])
else:
ways.sort(key=lambda w: w[2]) # LRU first
ways[0] = [vpn, pfn, self.clock]
return (pfn << self.offset_bits) | self._offset(addr)
def hit_rate(self): return self.hits / (self.hits + self.misses)
# Find the cliff: hit rate vs working-set size (in pages), 64-entry TLB.
for pages in (16, 32, 64, 96, 128, 256):
tlb = TLB(sets=16, ways=4) # 16 × 4 = 64 entries
for i in range(20_000):
tlb.translate((i % pages) * 4096)
print(f"{pages:4d} pages hit_rate={tlb.hit_rate():.3f}")
# 16 pages hit_rate=0.999 (fits comfortably)
# 32 pages hit_rate=0.998 (still room to spare)
# 64 pages hit_rate=0.997 (exactly fills the TLB)
# 96 pages hit_rate=0.000 (overflow → LRU thrash: every revisit was just evicted)
# 256 pages hit_rate=0.000 (working set ≫ reach — total collapse)
The cliff between 64 and 96 pages is the whole lesson: once the working set exceeds TLB reach, the hit rate collapses and page walks take over. The drop to a literal 0.000 here is the worst case — a strictly sequential round-robin scan against LRU, where every page is evicted just before it is touched again. A real workload with some temporal locality lands somewhere between this floor and the near-perfect regime, but the wall is just as real. CPUs soften it with a second-level STLB, yet the shape is identical — and it is exactly what huge pages push back out of reach.
Variants worth knowing
Multi-level TLBs. Just like the data-cache hierarchy, modern cores have a small, fast, fully-associative L1 dTLB and iTLB backed by a larger, slower, set-associative L2 "STLB" (shared TLB) of 1,024–2,048 entries. An L1 miss that hits the STLB costs a handful of cycles, far cheaper than a full page walk.
Separate instruction and data TLBs. Like split L1 caches, most cores keep a dTLB for loads/stores and an iTLB for instruction fetches, so the two streams don't evict each other.
Tagged TLBs (PCID / ASID). Each entry carries an address-space identifier so translations from multiple processes coexist. Switching processes swaps the active tag instead of flushing — Linux turned PCID on by default in kernel 4.14, which mattered enormously after the Meltdown mitigation made flushes frequent.
Software-managed TLBs. On classic MIPS and SPARC the hardware has no page walker; a TLB miss traps to an OS handler that fills the entry. Slower per miss but far more flexible in page-table format — the kernel can use any structure it likes.
Page-walk caches (PWC / paging-structure caches). Small caches for the upper page-table levels (PML4/PDPT/PD). They don't cache final translations like a TLB; they shortcut the walk so a miss usually re-reads only the bottom level or two.
Common bugs, myths, and edge cases
- Forgetting to invalidate after a mapping change. If the OS edits a page-table entry, every core caching the old translation must invalidate it (
INVLPGlocally, plus a TLB shootdown IPI to remote cores). Skip this and a core happily uses a stale, possibly freed, physical frame — a security and correctness disaster. - Assuming the offset is translated. It isn't. Only the page number maps; the bottom 12 bits (for 4 KB pages) pass through unchanged. Treating the full address as the cache key is the classic beginner mistake.
- Believing the TLB caches data. It caches translations, not memory contents. A TLB hit still costs a data-cache or DRAM access to fetch the actual bytes.
- Ignoring the context-switch flush. On hardware without PCID/ASID, every switch starts cold and the next thousand accesses pay page walks — a real cost on syscall-heavy and thread-heavy workloads.
- Mixing page sizes carelessly. Most L1 TLBs have separate sub-structures for 4 KB and 2 MB entries with different (small) capacities. Fragmenting an allocation across both can leave you with fewer huge-page slots than you expected.
- Blaming the TLB without measuring. Use
perf stat -e dTLB-load-misses,dTLB-load-misses-walk-durationfirst. If walk cycles are negligible, the TLB is not your problem and huge pages won't help.
Frequently asked questions
What does the TLB actually store?
Each TLB entry maps a virtual page number to a physical frame number, plus the permission and status bits from the page-table entry: present, writable, user/supervisor, dirty, accessed, and the no-execute bit. It is keyed by the virtual page number, not the full address — the page offset is never translated and passes straight through.
Why is a TLB miss so much more expensive than a hit?
A hit returns the physical frame in about one cycle, overlapped with the cache lookup so it is effectively free. A miss forces a page-table walk. On x86-64 that is a four-level descent — PML4, PDPT, PD, PT — and each level is a separate memory reference, so a fully uncached walk costs roughly 100 to 300 cycles.
How big is a TLB and why is it so small?
A typical L1 data TLB holds 64 entries and the L2 (STLB) holds 1,024 to 2,048. It is tiny because it must answer in a single cycle on the critical path of every load and store; making it larger would lengthen that cycle. The page table, by contrast, lives in DRAM and can be gigabytes.
Does the TLB get flushed on a context switch?
Historically yes — switching the page-table base register invalidated every entry, since virtual addresses now mean a different process. Modern CPUs avoid this by tagging entries with an address-space identifier (PCID on x86, ASID on ARM), so a switch keeps both processes' translations alive in the TLB.
How do huge pages help the TLB?
One 2 MB huge page covers the same memory as 512 4 KB pages but uses a single TLB entry, so the same 64-entry TLB can reach 512× more memory. A 64-entry TLB covers 256 KB with 4 KB pages but 128 MB with 2 MB pages, which is why databases and JVMs enable huge pages to cut page-walk overhead.
Is the TLB part of the CPU cache or separate?
It is a separate, dedicated cache inside the memory management unit (MMU), distinct from the L1/L2/L3 data and instruction caches. The data caches store memory contents keyed by physical address; the TLB stores address translations keyed by virtual page number. The two are queried in parallel so translation overlaps the cache access.