Question 1

What is a TLB shootdown?

Accepted Answer

A TLB shootdown is the coordinated invalidation of stale Translation Lookaside Buffer entries across all CPUs in a system after one CPU changes a page-table entry. Because the TLB is private per core, modifying a PTE on the host CPU does not magically update other cores' TLBs — the issuing CPU has to send each of them an inter-processor interrupt asking them to execute INVLPG (x86) or TLBI (ARM) and ack.

Question 2

When does the kernel issue a TLB shootdown?

Accepted Answer

Any time a page-table entry might be cached as something that no longer matches reality. munmap (page is now unmapped), mprotect (permissions changed), MADV_DONTNEED (page is freed), page migration, copy-on-write breaking, kernel huge-page splits, swap-out, NUMA balancing, and KSM merges all trigger them. Linux's flush_tlb_mm and flush_tlb_range orchestrate the broadcast.

Question 3

How expensive is one TLB shootdown?

Accepted Answer

Roughly 1 microsecond of base cost for the IPI mechanics, plus about 200 nanoseconds per receiver CPU. On a 4-core system: ~1.6 µs end-to-end. On a 64-core EPYC: ~14 µs. On a 256-core HPC node: ~50+ µs. Worse, the initiating CPU spins until the last ack arrives, so the shootdown is a hard serialization point — no other work happens on that core.

Question 4

Why do all cores have to flush, not just the ones that touched the page?

Accepted Answer

Because hardware doesn't track which cores actually have the entry cached. The kernel knows which mm_struct a process belongs to, so the broadcast can be limited to the mm's cpumask — but every CPU in that mask has to be told, even those that never speculatively walked the page table for that address. ASID/PCID lets cores keep entries from multiple processes without re-flushing on context switch, but the shootdown still has to reach each potentially-cached node.

Question 5

What's INVPCID and how does it help?

Accepted Answer

INVPCID is an x86 instruction (Haswell+) that invalidates TLB entries scoped to a Process-Context Identifier. It's faster than the older INVLPG sweep because it can invalidate by PCID rather than by full address-space flush. Combined with PCID-tagged TLBs, the kernel can run user processes and kernel code with fewer global flushes — particularly important post-KPTI, where each syscall used to cost a full TLB flush before PCID-aware shootdowns shipped.

Question 6

What are LASI and RMP and hardware shootdown offload?

Accepted Answer

Recent hardware research and proposals add direct memory-mapped TLB invalidation primitives — Arm's TLBI broadcasts work over the coherent interconnect, AMD's range-based RMP invalidations, Intel's Remote Action Request. The trend is to move the broadcast from software-orchestrated IPIs to hardware-implemented coherent operations. These cut shootdown cost by 5-10× on large core counts but require coordinated software changes.

Question 7

How do I avoid TLB shootdowns in my code?

Accepted Answer

Batch unmappings (one big munmap is cheaper than many small ones). Use MAP_FIXED reservations to avoid repeated mmap/munmap. Prefer huge pages (2 MiB or 1 GiB) so each shootdown covers more memory. For high-frequency mmap/munmap workloads, the io_uring registered buffer API and persistent mmap with reuse-via-reset patterns sidestep shootdowns. Database engines use ring-buffer page caches for the same reason.

Cores in mm cpumask	Total shootdown cost	Critical path	Throughput impact
2	~1.2 µs	1 IPI + 1 ack	~1k shootdowns/s before pain
4	~1.6 µs	3 IPIs in parallel	~600 shootdowns/s/core
16	~4 µs	15 IPIs, ack wait	~250 shootdowns/s/core
64	~14 µs	63 IPIs, longest tail wins	~70 shootdowns/s/core
128 (2× 64-core EPYC)	~28 µs	IPI traverses inter-socket	~35 shootdowns/s/core
256+ (HPC node)	~50–100 µs	Coherence + IPI fanout	~10 shootdowns/s/core

TLB Shootdown

Interactive visualization

How a TLB shootdown unfolds

Cost scaling with core count

When shootdowns fire

Code paths and intrinsics

Performance numbers

Common pitfalls

Frequently asked questions