Computer Architecture

TLB Shootdown

Inter-processor flush — the scaling tax on every munmap

When one CPU modifies a page table entry, every other CPU holding stale TLB copies must flush. The IPI plus INVLPG round trip costs 1–10 µs and scales with core count.

  • Base IPI cost~1 µs
  • Per-receiver cost~200 ns
  • 4-core total~1.6 µs
  • 64-core total~14 µs
  • 256-core total~50 µs
  • x86 instructionsINVLPG, INVPCID

Interactive visualization

Press play, or step through manually. Watch CPU 0 unmap a page, fire an IPI to every other core, and wait for the last ack — and see the latency bar grow as you add cores.

Open visualization fullscreen ↗

How a TLB shootdown unfolds

Every CPU caches its own copy of recent virtual-to-physical translations in a per-core Translation Lookaside Buffer. The TLB is fast — a hit is part of address-generation, effectively free — but it's also stale-prone. The page-table entry it caches can be modified by any thread that holds the same mm_struct: another core executing munmap, mprotect, or just servicing a copy-on-write fault.

Hardware doesn't snoop the TLB the way the cache coherence protocol snoops the L1. Once a translation lands in a TLB it stays there until something explicitly invalidates it. So when one core changes a PTE, software has to broadcast a "throw out your copy" message to every other core that might have it cached — that's the shootdown.

On Linux, the shootdown choreography is roughly:

  1. The initiating CPU acquires the page-table lock and modifies the PTE.
  2. It walks the mm_struct's cpumask — the set of CPUs that have ever run a thread of this process.
  3. It sends an Inter-Processor Interrupt (IPI) to every other CPU in that mask.
  4. Each receiver enters an interrupt handler, executes INVLPG or INVPCID for the relevant address(es), and signals completion in shared memory.
  5. The initiator spins on the completion counter until every recipient has acked.
  6. The initiator releases the lock and continues.

The whole dance is serializing for the initiator and disruptive for the receivers — each one has to interrupt whatever user-space code it was running, take an interrupt, execute kernel code, and resume.

Cost scaling with core count

Cores in mm cpumaskTotal shootdown costCritical pathThroughput impact
2~1.2 µs1 IPI + 1 ack~1k shootdowns/s before pain
4~1.6 µs3 IPIs in parallel~600 shootdowns/s/core
16~4 µs15 IPIs, ack wait~250 shootdowns/s/core
64~14 µs63 IPIs, longest tail wins~70 shootdowns/s/core
128 (2× 64-core EPYC)~28 µsIPI traverses inter-socket~35 shootdowns/s/core
256+ (HPC node)~50–100 µsCoherence + IPI fanout~10 shootdowns/s/core

The pattern: base latency plus a fan-out cost. The "base" comes from APIC programming, interrupt delivery, and the kernel's interrupt-entry path; the per-receiver cost is the actual INVLPG plus the bus turnaround for the ack. Modern AMD EPYC and Intel Granite Rapids amortize a lot of this with broadcast IPI, but the trend with more cores per node is still upward.

When shootdowns fire

  • munmap. The page is gone; any TLB caching it must be invalidated. Frequent in dynamic-linker workloads (DSO load/unload), database page replacement, and certain GC algorithms that release back-store ranges.
  • mprotect / mremap. Permissions or layout changed, so TLB-cached protection bits are wrong.
  • MADV_DONTNEED. The kernel may zero pages, freeing the backing — TLB entries that mapped them must die.
  • Copy-on-write resolution. A child write to a shared page promotes a private copy; the parent's mapping changes.
  • NUMA balancing (auto-numa). The kernel migrates a page between nodes and remaps all PTEs pointing to it.
  • Transparent huge-page split/merge. 2 MiB collapsed back into 4 KiB pages (or vice versa), invalidating both the PMD and the constituent PTE TLB entries.
  • Kernel direct-map adjustments. Reclaim, page-stealer activity in the slab allocator, KSM merges.

Code paths and intrinsics

// Linux kernel: a TLB shootdown is roughly this.
// (Simplified; the real path goes through mmu_gather batching.)

void flush_tlb_range(struct vm_area_struct *vma,
                     unsigned long start, unsigned long end) {
    struct mm_struct *mm = vma->vm_mm;

    if (cpumask_weight(mm_cpumask(mm)) == 1 &&
        cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) {
        // Single-CPU mm — just flush locally, no IPI.
        local_flush_tlb_range(vma, start, end);
        return;
    }
    // Broadcast: IPI to every other CPU in mm_cpumask.
    smp_call_function_many(mm_cpumask(mm),
                           do_flush_tlb_ipi,
                           &info,
                           /*wait=*/1);
    local_flush_tlb_range(vma, start, end);
}

// On the receiver, the IPI handler runs:
static void do_flush_tlb_ipi(void *info) {
    // x86: INVLPG for each page, or full-mm flush.
    invlpg(addr);  // emits ASM: invlpg (%[addr])
    // With PCID, prefer INVPCID for narrower scope:
    // invpcid_flush_single_addr(pcid, addr);
}
# Inspect TLB shootdown traffic on Linux.

# Per-CPU TLB shootdown count from /proc/interrupts.
cat /proc/interrupts | grep TLB
# Output: TLB:    142    156    134    ...   TLB shootdowns

# bpftrace one-liner — count shootdowns by code path.
sudo bpftrace -e '
  kprobe:flush_tlb_mm_range { @[kstack(3)] = count(); }'

# perf — TLB shootdown events as a top-down breakdown.
sudo perf stat -e tlb:tlb_flush -- ./workload

Performance numbers

  • Base IPI cost: ~1 µs from APIC write to handler entry on the receiver.
  • Per-receiver INVLPG: ~50 ns; ack memory write: ~150 ns; total per receiver: ~200 ns.
  • Real measurements on a 64-core EPYC Milan: median 9 µs, p99 18 µs per range shootdown — exactly the (1 + 0.2 × 63) µs model predicts.
  • A workload doing 50 small mmap/munmap pairs per second from each of 32 threads on a 64-core box: ~32 × 50 × 14 µs = 22 ms/s of pure shootdown serialization — 2.2% of one core's time lost on every core in the cpumask.
  • Microsoft research (2020) on SQL Server: TLB shootdowns from buffer-pool page replacement cost up to 8% of throughput on 56-core SKUs.
  • With 2 MiB huge pages, the shootdown count drops 512× for the same byte range — directly reducing shootdown overhead by the same factor.

Common pitfalls

  • High-frequency mmap/munmap loops. Languages that allocate via mmap (Go's runtime, OpenJDK ZGC's metadata) historically hit shootdown walls at high core counts. The fix is internal pooling.
  • Cross-socket scheduling without pinning. If a process's threads roam across sockets, the mm_cpumask grows to include both sockets, doubling the IPI broadcast cost.
  • Forgetting deferred unmaps (mmu_gather). Some workloads call munmap in a hot loop; batch them into a single munmap of the union if possible.
  • Misusing MADV_DONTNEED for "memory pressure" hinting. It's not free — every call triggers a shootdown on every core in the cpumask.
  • Assuming huge pages always help. A 2 MiB shootdown is one IPI, but a transparent-huge-page split forced by mprotect-on-a-4KiB-subset can be more expensive than the un-promoted original.
  • Missing the spin-wait on the initiator. The initiator does not yield — it busy-loops on the completion counter. Profiling tools often blame the syscall instead of the shootdown that's actually consuming the cycles.

Frequently asked questions

What is a TLB shootdown?

Coordinated invalidation of stale TLB entries across all CPUs after one CPU changes a PTE. Hardware doesn't snoop TLBs — software has to broadcast an IPI asking every other core to INVLPG.

When does the kernel issue a TLB shootdown?

munmap, mprotect, MADV_DONTNEED, COW resolution, NUMA balancing, THP split, swap-out, KSM merges. Linux's flush_tlb_mm and flush_tlb_range orchestrate the broadcast.

How expensive is one TLB shootdown?

~1 µs base + ~200 ns per receiver. 4 cores ~1.6 µs; 64 cores ~14 µs; 256 cores 50+ µs. The initiator spins until the last ack arrives — full serialization.

Why do all cores have to flush, not just the ones that touched the page?

Hardware doesn't track per-core TLB residency. The kernel can scope to the mm's cpumask but every CPU in that mask must be told. PCID helps with context-switch flush, not with shootdowns.

What's INVPCID and how does it help?

An x86 instruction (Haswell+) that invalidates TLB entries scoped to a Process-Context Identifier. Combined with PCID-tagged TLBs, it cuts post-KPTI flush cost significantly.

What about hardware shootdown offload?

Arm's TLBI broadcasts over the coherent interconnect, AMD's range RMP invalidations, Intel's Remote Action Request. Move broadcast from software-IPI to hardware — 5–10× cheaper on large core counts.

How do I avoid TLB shootdowns in my code?

Batch unmaps; use MAP_FIXED reservations; prefer huge pages (2 MiB cuts shootdown count 512×); use io_uring registered buffers and persistent mmap with reuse-via-reset.