Computer Architecture
TLB Shootdown
Inter-processor flush — the scaling tax on every munmap
When one CPU modifies a page table entry, every other CPU holding stale TLB copies must flush. The IPI plus INVLPG round trip costs 1–10 µs and scales with core count.
- Base IPI cost~1 µs
- Per-receiver cost~200 ns
- 4-core total~1.6 µs
- 64-core total~14 µs
- 256-core total~50 µs
- x86 instructionsINVLPG, INVPCID
Interactive visualization
Press play, or step through manually. Watch CPU 0 unmap a page, fire an IPI to every other core, and wait for the last ack — and see the latency bar grow as you add cores.
How a TLB shootdown unfolds
Every CPU caches its own copy of recent virtual-to-physical translations in a per-core Translation Lookaside Buffer. The TLB is fast — a hit is part of address-generation, effectively free — but it's also stale-prone. The page-table entry it caches can be modified by any thread that holds the same mm_struct: another core executing munmap, mprotect, or just servicing a copy-on-write fault.
Hardware doesn't snoop the TLB the way the cache coherence protocol snoops the L1. Once a translation lands in a TLB it stays there until something explicitly invalidates it. So when one core changes a PTE, software has to broadcast a "throw out your copy" message to every other core that might have it cached — that's the shootdown.
On Linux, the shootdown choreography is roughly:
- The initiating CPU acquires the page-table lock and modifies the PTE.
- It walks the mm_struct's
cpumask— the set of CPUs that have ever run a thread of this process. - It sends an Inter-Processor Interrupt (IPI) to every other CPU in that mask.
- Each receiver enters an interrupt handler, executes
INVLPGorINVPCIDfor the relevant address(es), and signals completion in shared memory. - The initiator spins on the completion counter until every recipient has acked.
- The initiator releases the lock and continues.
The whole dance is serializing for the initiator and disruptive for the receivers — each one has to interrupt whatever user-space code it was running, take an interrupt, execute kernel code, and resume.
Cost scaling with core count
| Cores in mm cpumask | Total shootdown cost | Critical path | Throughput impact |
|---|---|---|---|
| 2 | ~1.2 µs | 1 IPI + 1 ack | ~1k shootdowns/s before pain |
| 4 | ~1.6 µs | 3 IPIs in parallel | ~600 shootdowns/s/core |
| 16 | ~4 µs | 15 IPIs, ack wait | ~250 shootdowns/s/core |
| 64 | ~14 µs | 63 IPIs, longest tail wins | ~70 shootdowns/s/core |
| 128 (2× 64-core EPYC) | ~28 µs | IPI traverses inter-socket | ~35 shootdowns/s/core |
| 256+ (HPC node) | ~50–100 µs | Coherence + IPI fanout | ~10 shootdowns/s/core |
The pattern: base latency plus a fan-out cost. The "base" comes from APIC programming, interrupt delivery, and the kernel's interrupt-entry path; the per-receiver cost is the actual INVLPG plus the bus turnaround for the ack. Modern AMD EPYC and Intel Granite Rapids amortize a lot of this with broadcast IPI, but the trend with more cores per node is still upward.
When shootdowns fire
- munmap. The page is gone; any TLB caching it must be invalidated. Frequent in dynamic-linker workloads (DSO load/unload), database page replacement, and certain GC algorithms that release back-store ranges.
- mprotect / mremap. Permissions or layout changed, so TLB-cached protection bits are wrong.
- MADV_DONTNEED. The kernel may zero pages, freeing the backing — TLB entries that mapped them must die.
- Copy-on-write resolution. A child write to a shared page promotes a private copy; the parent's mapping changes.
- NUMA balancing (auto-numa). The kernel migrates a page between nodes and remaps all PTEs pointing to it.
- Transparent huge-page split/merge. 2 MiB collapsed back into 4 KiB pages (or vice versa), invalidating both the PMD and the constituent PTE TLB entries.
- Kernel direct-map adjustments. Reclaim, page-stealer activity in the slab allocator, KSM merges.
Code paths and intrinsics
// Linux kernel: a TLB shootdown is roughly this.
// (Simplified; the real path goes through mmu_gather batching.)
void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end) {
struct mm_struct *mm = vma->vm_mm;
if (cpumask_weight(mm_cpumask(mm)) == 1 &&
cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) {
// Single-CPU mm — just flush locally, no IPI.
local_flush_tlb_range(vma, start, end);
return;
}
// Broadcast: IPI to every other CPU in mm_cpumask.
smp_call_function_many(mm_cpumask(mm),
do_flush_tlb_ipi,
&info,
/*wait=*/1);
local_flush_tlb_range(vma, start, end);
}
// On the receiver, the IPI handler runs:
static void do_flush_tlb_ipi(void *info) {
// x86: INVLPG for each page, or full-mm flush.
invlpg(addr); // emits ASM: invlpg (%[addr])
// With PCID, prefer INVPCID for narrower scope:
// invpcid_flush_single_addr(pcid, addr);
}
# Inspect TLB shootdown traffic on Linux.
# Per-CPU TLB shootdown count from /proc/interrupts.
cat /proc/interrupts | grep TLB
# Output: TLB: 142 156 134 ... TLB shootdowns
# bpftrace one-liner — count shootdowns by code path.
sudo bpftrace -e '
kprobe:flush_tlb_mm_range { @[kstack(3)] = count(); }'
# perf — TLB shootdown events as a top-down breakdown.
sudo perf stat -e tlb:tlb_flush -- ./workload
Performance numbers
- Base IPI cost: ~1 µs from APIC write to handler entry on the receiver.
- Per-receiver INVLPG: ~50 ns; ack memory write: ~150 ns; total per receiver: ~200 ns.
- Real measurements on a 64-core EPYC Milan: median 9 µs, p99 18 µs per range shootdown — exactly the (1 + 0.2 × 63) µs model predicts.
- A workload doing 50 small mmap/munmap pairs per second from each of 32 threads on a 64-core box: ~32 × 50 × 14 µs = 22 ms/s of pure shootdown serialization — 2.2% of one core's time lost on every core in the cpumask.
- Microsoft research (2020) on SQL Server: TLB shootdowns from buffer-pool page replacement cost up to 8% of throughput on 56-core SKUs.
- With 2 MiB huge pages, the shootdown count drops 512× for the same byte range — directly reducing shootdown overhead by the same factor.
Common pitfalls
- High-frequency mmap/munmap loops. Languages that allocate via mmap (Go's runtime, OpenJDK ZGC's metadata) historically hit shootdown walls at high core counts. The fix is internal pooling.
- Cross-socket scheduling without pinning. If a process's threads roam across sockets, the mm_cpumask grows to include both sockets, doubling the IPI broadcast cost.
- Forgetting deferred unmaps (mmu_gather). Some workloads call munmap in a hot loop; batch them into a single munmap of the union if possible.
- Misusing MADV_DONTNEED for "memory pressure" hinting. It's not free — every call triggers a shootdown on every core in the cpumask.
- Assuming huge pages always help. A 2 MiB shootdown is one IPI, but a transparent-huge-page split forced by mprotect-on-a-4KiB-subset can be more expensive than the un-promoted original.
- Missing the spin-wait on the initiator. The initiator does not yield — it busy-loops on the completion counter. Profiling tools often blame the syscall instead of the shootdown that's actually consuming the cycles.
Frequently asked questions
What is a TLB shootdown?
Coordinated invalidation of stale TLB entries across all CPUs after one CPU changes a PTE. Hardware doesn't snoop TLBs — software has to broadcast an IPI asking every other core to INVLPG.
When does the kernel issue a TLB shootdown?
munmap, mprotect, MADV_DONTNEED, COW resolution, NUMA balancing, THP split, swap-out, KSM merges. Linux's flush_tlb_mm and flush_tlb_range orchestrate the broadcast.
How expensive is one TLB shootdown?
~1 µs base + ~200 ns per receiver. 4 cores ~1.6 µs; 64 cores ~14 µs; 256 cores 50+ µs. The initiator spins until the last ack arrives — full serialization.
Why do all cores have to flush, not just the ones that touched the page?
Hardware doesn't track per-core TLB residency. The kernel can scope to the mm's cpumask but every CPU in that mask must be told. PCID helps with context-switch flush, not with shootdowns.
What's INVPCID and how does it help?
An x86 instruction (Haswell+) that invalidates TLB entries scoped to a Process-Context Identifier. Combined with PCID-tagged TLBs, it cuts post-KPTI flush cost significantly.
What about hardware shootdown offload?
Arm's TLBI broadcasts over the coherent interconnect, AMD's range RMP invalidations, Intel's Remote Action Request. Move broadcast from software-IPI to hardware — 5–10× cheaper on large core counts.
How do I avoid TLB shootdowns in my code?
Batch unmaps; use MAP_FIXED reservations; prefer huge pages (2 MiB cuts shootdown count 512×); use io_uring registered buffers and persistent mmap with reuse-via-reset.