What is a TLB shootdown?

A TLB shootdown is the coordinated invalidation of stale Translation Lookaside Buffer entries across all CPUs in a system after one CPU changes a page-table entry. Because the TLB is private per core, modifying a PTE on the host CPU does not magically update other cores' TLBs — the issuing CPU has to send each of them an inter-processor interrupt asking them to execute INVLPG (x86) or TLBI (ARM) and ack.

When does the kernel issue a TLB shootdown?

Any time a page-table entry might be cached as something that no longer matches reality. munmap (page is now unmapped), mprotect (permissions changed), MADV_DONTNEED (page is freed), page migration, copy-on-write breaking, kernel huge-page splits, swap-out, NUMA balancing, and KSM merges all trigger them. Linux's flush_tlb_mm and flush_tlb_range orchestrate the broadcast.

How expensive is one TLB shootdown?

Roughly 1 microsecond of base cost for the IPI mechanics, plus about 200 nanoseconds per receiver CPU. On a 4-core system: ~1.6 µs end-to-end. On a 64-core EPYC: ~14 µs. On a 256-core HPC node: ~50+ µs. Worse, the initiating CPU spins until the last ack arrives, so the shootdown is a hard serialization point — no other work happens on that core.

Why do all cores have to flush, not just the ones that touched the page?

Because hardware doesn't track which cores actually have the entry cached. The kernel knows which mm_struct a process belongs to, so the broadcast can be limited to the mm's cpumask — but every CPU in that mask has to be told, even those that never speculatively walked the page table for that address. ASID/PCID lets cores keep entries from multiple processes without re-flushing on context switch, but the shootdown still has to reach each potentially-cached node.

What's INVPCID and how does it help?

INVPCID is an x86 instruction (Haswell+) that invalidates TLB entries scoped to a Process-Context Identifier. It's faster than the older INVLPG sweep because it can invalidate by PCID rather than by full address-space flush. Combined with PCID-tagged TLBs, the kernel can run user processes and kernel code with fewer global flushes — particularly important post-KPTI, where each syscall used to cost a full TLB flush before PCID-aware shootdowns shipped.

What are LASI and RMP and hardware shootdown offload?

Recent hardware research and proposals add direct memory-mapped TLB invalidation primitives — Arm's TLBI broadcasts work over the coherent interconnect, AMD's range-based RMP invalidations, Intel's Remote Action Request. The trend is to move the broadcast from software-orchestrated IPIs to hardware-implemented coherent operations. These cut shootdown cost by 5-10× on large core counts but require coordinated software changes.

How do I avoid TLB shootdowns in my code?

Batch unmappings (one big munmap is cheaper than many small ones). Use MAP_FIXED reservations to avoid repeated mmap/munmap. Prefer huge pages (2 MiB or 1 GiB) so each shootdown covers more memory. For high-frequency mmap/munmap workloads, the io_uring registered buffer API and persistent mmap with reuse-via-reset patterns sidestep shootdowns. Database engines use ring-buffer page caches for the same reason.

TLB Shootdown — IPI Cost & Scaling

How a TLB shootdown unfolds

Every CPU caches its own copy of recent virtual-to-physical translations in a per-core Translation Lookaside Buffer. The TLB is fast — a hit is part of address-generation, effectively free — but it's also stale-prone. The page-table entry it caches can be modified by any thread that holds the same mm_struct: another core executing munmap, mprotect, or just servicing a copy-on-write fault.

Hardware doesn't snoop the TLB the way the cache coherence protocol snoops the L1. Once a translation lands in a TLB it stays there until something explicitly invalidates it. So when one core changes a PTE, software has to broadcast a "throw out your copy" message to every other core that might have it cached — that's the shootdown.

On Linux, the shootdown choreography is roughly:

The initiating CPU acquires the page-table lock and modifies the PTE.
It walks the mm_struct's cpumask — the set of CPUs that have ever run a thread of this process.
It sends an Inter-Processor Interrupt (IPI) to every other CPU in that mask.
Each receiver enters an interrupt handler, executes INVLPG or INVPCID for the relevant address(es), and signals completion in shared memory.
The initiator spins on the completion counter until every recipient has acked.
The initiator releases the lock and continues.

The whole dance is serializing for the initiator and disruptive for the receivers — each one has to interrupt whatever user-space code it was running, take an interrupt, execute kernel code, and resume.

Cost scaling with core count

Cores in mm cpumask	Total shootdown cost	Critical path	Throughput impact
2	~1.2 µs	1 IPI + 1 ack	~1k shootdowns/s before pain
4	~1.6 µs	3 IPIs in parallel	~600 shootdowns/s/core
16	~4 µs	15 IPIs, ack wait	~250 shootdowns/s/core
64	~14 µs	63 IPIs, longest tail wins	~70 shootdowns/s/core
128 (2× 64-core EPYC)	~28 µs	IPI traverses inter-socket	~35 shootdowns/s/core
256+ (HPC node)	~50–100 µs	Coherence + IPI fanout	~10 shootdowns/s/core

The pattern: base latency plus a fan-out cost. The "base" comes from APIC programming, interrupt delivery, and the kernel's interrupt-entry path; the per-receiver cost is the actual INVLPG plus the bus turnaround for the ack. Modern AMD EPYC and Intel Granite Rapids amortize a lot of this with broadcast IPI, but the trend with more cores per node is still upward.

When shootdowns fire

munmap. The page is gone; any TLB caching it must be invalidated. Frequent in dynamic-linker workloads (DSO load/unload), database page replacement, and certain GC algorithms that release back-store ranges.
mprotect / mremap. Permissions or layout changed, so TLB-cached protection bits are wrong.
MADV_DONTNEED. The kernel may zero pages, freeing the backing — TLB entries that mapped them must die.
Copy-on-write resolution. A child write to a shared page promotes a private copy; the parent's mapping changes.
NUMA balancing (auto-numa). The kernel migrates a page between nodes and remaps all PTEs pointing to it.
Transparent huge-page split/merge. 2 MiB collapsed back into 4 KiB pages (or vice versa), invalidating both the PMD and the constituent PTE TLB entries.
Kernel direct-map adjustments. Reclaim, page-stealer activity in the slab allocator, KSM merges.

Code paths and intrinsics

// Linux kernel: a TLB shootdown is roughly this.
// (Simplified; the real path goes through mmu_gather batching.)

void flush_tlb_range(struct vm_area_struct *vma,
                     unsigned long start, unsigned long end) {
    struct mm_struct *mm = vma->vm_mm;

    if (cpumask_weight(mm_cpumask(mm)) == 1 &&
        cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) {
        // Single-CPU mm — just flush locally, no IPI.
        local_flush_tlb_range(vma, start, end);
        return;
    }
    // Broadcast: IPI to every other CPU in mm_cpumask.
    smp_call_function_many(mm_cpumask(mm),
                           do_flush_tlb_ipi,
                           &info,
                           /*wait=*/1);
    local_flush_tlb_range(vma, start, end);
}

// On the receiver, the IPI handler runs:
static void do_flush_tlb_ipi(void *info) {
    // x86: INVLPG for each page, or full-mm flush.
    invlpg(addr);  // emits ASM: invlpg (%[addr])
    // With PCID, prefer INVPCID for narrower scope:
    // invpcid_flush_single_addr(pcid, addr);
}

# Inspect TLB shootdown traffic on Linux.

# Per-CPU TLB shootdown count from /proc/interrupts.
cat /proc/interrupts | grep TLB
# Output: TLB:    142    156    134    ...   TLB shootdowns

# bpftrace one-liner — count shootdowns by code path.
sudo bpftrace -e '
  kprobe:flush_tlb_mm_range { @[kstack(3)] = count(); }'

# perf — TLB shootdown events as a top-down breakdown.
sudo perf stat -e tlb:tlb_flush -- ./workload

Performance numbers

Base IPI cost: ~1 µs from APIC write to handler entry on the receiver.
Per-receiver INVLPG: ~50 ns; ack memory write: ~150 ns; total per receiver: ~200 ns.
Real measurements on a 64-core EPYC Milan: median 9 µs, p99 18 µs per range shootdown — exactly the (1 + 0.2 × 63) µs model predicts.
A workload doing 50 small mmap/munmap pairs per second from each of 32 threads on a 64-core box: ~32 × 50 × 14 µs = 22 ms/s of pure shootdown serialization — 2.2% of one core's time lost on every core in the cpumask.
Microsoft research (2020) on SQL Server: TLB shootdowns from buffer-pool page replacement cost up to 8% of throughput on 56-core SKUs.
With 2 MiB huge pages, the shootdown count drops 512× for the same byte range — directly reducing shootdown overhead by the same factor.

Common pitfalls

High-frequency mmap/munmap loops. Languages that allocate via mmap (Go's runtime, OpenJDK ZGC's metadata) historically hit shootdown walls at high core counts. The fix is internal pooling.
Cross-socket scheduling without pinning. If a process's threads roam across sockets, the mm_cpumask grows to include both sockets, doubling the IPI broadcast cost.
Forgetting deferred unmaps (mmu_gather). Some workloads call munmap in a hot loop; batch them into a single munmap of the union if possible.
Misusing MADV_DONTNEED for "memory pressure" hinting. It's not free — every call triggers a shootdown on every core in the cpumask.
Assuming huge pages always help. A 2 MiB shootdown is one IPI, but a transparent-huge-page split forced by mprotect-on-a-4KiB-subset can be more expensive than the un-promoted original.
Missing the spin-wait on the initiator. The initiator does not yield — it busy-loops on the completion counter. Profiling tools often blame the syscall instead of the shootdown that's actually consuming the cycles.

TLB Shootdown

Interactive visualization

Watch the 60-second explainer

How a TLB shootdown unfolds

Cost scaling with core count

When shootdowns fire

Code paths and intrinsics

Performance numbers

Common pitfalls

Frequently asked questions

Interactive visualization

Watch the 60-second explainer

How a TLB shootdown unfolds

Cost scaling with core count

When shootdowns fire

Code paths and intrinsics

Performance numbers

Common pitfalls

Frequently asked questions

Related concepts