Concurrency

False Sharing

Two threads write to neighboring bytes — cache coherence forces invalidation pings

False sharing happens when two CPUs each write to logically-independent variables that reside on the same cache line (typically 64 bytes on x86). Even though the variables are unrelated, the cache coherence protocol (MESI) treats the entire line as a unit — every write on one core invalidates the line in the other core's cache, forcing it to re-fetch from memory or another L3. The result: a 10×–100× slowdown vs the same code with the variables on separate lines. Detected via Linux perf c2c, fixed via alignas(64) padding or __attribute__((aligned(64))). Famous in concurrent counters and lock-free queues.

  • Cache line64 bytes (x86, modern ARM)
  • Slowdown10-100×
  • Detectionperf c2c
  • FixAlign to cache line
  • Coherence protocolMESI/MESIF/MOESI
  • Common inPer-thread counter arrays

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why false sharing matters

  • Parallel scalability cliffs. Code that "should" scale linearly with cores instead flat-lines or regresses past 4-8 threads. False sharing is a top suspect when nothing else explains the lack of speedup.
  • Lock-free queue throughput. Producer pushes, consumer pops — head and tail pointers on adjacent cache slots produce 10× slowdown. Disruptor (LMAX, 2010) achieved 100M ops/sec specifically by padding head/tail/producer-sequence to separate lines.
  • Per-thread counter arrays. The textbook example: int counts[NUM_THREADS]. With 4-byte ints, 16 counters fit in one cache line; 16 threads bouncing the same line is catastrophic. Java's LongAdder, .NET's Interlocked counters, and Go's runtime per-P stats all pad to avoid this.
  • Atomic reference counters. std::shared_ptr's control block has a refcount and weakcount adjacent in memory; high-contention shared_ptr code burns 30-50% of CPU on cache-line bouncing.
  • Lock acquire fast paths. A spinlock variable next to other lock state contends with itself across cores. Modern futex implementations isolate the lock word.
  • Network packet processing. Per-CPU statistics (DPDK, kernel network stack) must be padded; otherwise NIC RX queue stats invalidate scheduler stats.

How false sharing happens — step by step

Suppose you write a histogram with one counter per worker thread:

std::atomic<uint64_t> counters[16];
// each thread does: counters[my_id]++;

An std::atomic<uint64_t> is 8 bytes, so 8 counters fit in one 64-byte line, 16 in two. With 16 threads incrementing 1 counter each:

  1. Thread 0 on CPU 0 increments counters[0]. Cache line A is fetched into CPU 0's L1, state = Modified.
  2. Thread 1 on CPU 1 increments counters[1]. CPU 1 sends a Read-For-Ownership to CPU 0, which invalidates its line and forwards the data. CPU 1 now holds line A in Modified.
  3. Thread 2 on CPU 2 increments counters[2]. Same dance: CPU 1 invalidates, ships to CPU 2.
  4. Each increment costs 30-100 ns of cross-core coherence traffic instead of the 4 ns a hot L1 atomic should cost. Throughput collapses.

The fix:

struct alignas(64) PaddedCounter {
    std::atomic<uint64_t> value;
};
PaddedCounter counters[16];

Each PaddedCounter now occupies its own 64-byte line. Throughput jumps 10-100×.

Detection in practice

Three production-grade tools:

  • Linux perf c2c. perf c2c record ./your_binary, then perf c2c report. Surfaces HITM events (cache-to-cache transfers caused by writes) grouped by cache line and source code location. The columns to watch: "Total HitMs" (high = bad) and "Source code line distribution" (multiple files writing the same line = false sharing).
  • Intel VTune. Memory Access analysis with HITM event. Visual heatmap of contended lines. Better Windows/macOS support than perf c2c.
  • Code review. Look for: arrays of per-thread state without padding; structs with adjacent fields written by different threads; lock-free queues whose head/tail are members of the same struct.

A simple sanity check: if perf stat -e cache-misses,cache-references shows >2% miss rate on a workload that reads mostly its own per-thread state, suspect false sharing.

Benchmark numbers

From a published microbenchmark on Intel Xeon (Cascade Lake, 2.5 GHz):

  • 16 threads, 16 counters, no padding (8-byte spacing): 12M increments/sec aggregated.
  • 16 threads, 16 counters, 64-byte padding: 1,400M increments/sec aggregated.
  • 16 threads, 16 counters, 128-byte padding (covers prefetcher): 1,500M increments/sec aggregated.
  • 16 threads, thread-local counters (no shared array): ~1,800M increments/sec — the upper bound.

The 64-byte fix recovers 100× throughput; 128-byte padding (covering prefetched-pair lines) wins another 7%; thread-local is the clean ceiling.

The 128-byte mystery

Why do some libraries (folly, JCTools, Disruptor) pad to 128 bytes when the cache line is 64? Two reasons:

  • Adjacent line prefetcher. Intel CPUs since Sandy Bridge fetch a pair of lines together when accessing the first. Writing line N causes a coherence transaction; line N+1 is also pulled by the prefetcher and may suffer secondary invalidation. 128-byte padding sidesteps this.
  • Apple Silicon and POWER. Some architectures use 128-byte lines outright. Cross-platform code defensively pads to 128 to be portable.

std::hardware_destructive_interference_size in C++17 reports the platform's pessimistic alignment for false-sharing avoidance — typically 64 on x86, 128 on POWER and Apple Silicon.

Common misconceptions

  • "Padding wastes memory." A per-thread counter padded from 8 to 64 bytes wastes 56 bytes per counter — at 64 cores that's 4 KB per array. Net win at any throughput where the counter is touched more than a few thousand times. Always.
  • "Only x86 problem." Modern ARM (Apple M-series, Cortex-A76+, Graviton) uses 64-byte cache lines with the same MESI-derivative protocols. POWER uses 128-byte lines and false-shares similarly. Even RISC-V SoCs are converging on 64-byte lines.
  • "Compiler fixes it." The compiler aligns single objects per their type alignment but cannot detect cross-thread access patterns. It will pack adjacent fields tightly without any sharing analysis. The fix must come from the programmer via alignas or library annotations.
  • "Reads can't false-share." Two cores reading the same line is fine — both hold it in Shared state. False sharing requires writes. But a single occasional write by one core periodically invalidates the line on all other cores, costing the readers re-fetch latency. Mostly-read scenarios are still vulnerable.
  • "std::atomic prevents it." std::atomic guarantees atomicity but does not enforce alignment beyond the natural type alignment. std::atomic<int> in an array still false-shares with its neighbors. You need alignas.
  • "Hyperthreaded cores share L1." True — sibling hyperthreads share L1, so they don't ping each other. But they ping any other physical core's L1 through the L3/interconnect. False sharing usually appears when threads land on different physical cores.

Mitigation patterns

  • Pad to std::hardware_destructive_interference_size. Portable C++17 idiom.
  • @Contended (Java). Apply to high-contention fields in concurrent classes.
  • Per-CPU partitioning. Each CPU maintains its own data; aggregate on read. Linux per_cpu, Go runtime mcache.
  • RCU / copy-on-write. Eliminate shared writes entirely by versioning; readers only ever read shared state.
  • Striped locks. Replace one hot lock with N (Java's StripedLock); each thread hashes to one of N, distributing cache-line load.
  • Lock-free with separated head/tail. Producer/consumer queues pad head, tail, and any sequence counters to separate lines (Disruptor, JCTools' MpscArrayQueue).

Frequently asked questions

What is a cache line?

A cache line is the unit of transfer between main memory and CPU caches. On modern x86 (since Pentium 4) and modern ARM (Apple Silicon, Cortex-A76+), it is 64 bytes. Some POWER and older ARM use 128 bytes. When the CPU reads any byte, the entire 64-byte aligned block is fetched into L1. Subsequent reads of any byte in that line hit cache (~1-4 cycles); reads of bytes in adjacent lines miss. Cache coherence operates at line granularity — the smallest unit the protocol can invalidate, share, or transfer.

Why does MESI cause this?

MESI tracks each cache line in one of four states: Modified, Exclusive, Shared, Invalid. To write a line, the core must own it Exclusively or Modified — meaning no other core may hold a valid copy. When core A writes its variable on line L, the protocol sends invalidation messages to all other cores, putting their copies of L into Invalid state, even if those cores never touched A's variable. When core B then tries to write its variable on the same line L, it must re-fetch L (now miss, ~30-100 ns) and again invalidate A's copy. Pingpong continues.

How do I detect false sharing?

Linux perf c2c (cache-to-cache) is the gold standard. Run 'perf c2c record' on your workload, then 'perf c2c report'; it shows HITM (hit-modified) events grouped by cache line, the source code lines hitting each, and which CPUs participate. High HITM rate (>1% of memory accesses) plus multiple cores reading/writing nearby addresses signals false sharing. On Intel, VTune's Memory Access analysis surfaces it as 'remote cache hit' or 'HITM'. On macOS, Instruments' Counters template plus dtrace probes.

What is alignas(64) in C++?

alignas(64) is the C++11 portable spelling of 'place this on a 64-byte boundary'. Equivalent to GCC __attribute__((aligned(64))) and MSVC __declspec(align(64)). Standard library exposes std::hardware_destructive_interference_size as the architecture's cache-line alignment for false-sharing avoidance. Common pattern: pad per-thread counters to a full line — struct Counter { alignas(64) std::atomic<uint64_t> value; }; — so each thread's value occupies its own cache line.

How does Java's @Contended annotation help?

Java's sun.misc.Contended (now jdk.internal.vm.annotation.Contended in JDK 9+) instructs the JVM to pad annotated fields with 128 bytes on each side — guaranteeing they don't share a line with anything else. Used in LongAdder, ForkJoinPool, ConcurrentHashMap counter cells. Requires -XX:-RestrictContended on user code. The 128-byte padding is paranoid: it covers Pentium 4 sectored cache (two adjacent 64-byte lines transferred together) and prefetcher pulling adjacent lines. Net memory cost per @Contended field: 128 bytes.

Is true sharing always a bigger problem?

Not necessarily. True sharing (multiple threads access the same variable) is usually intentional and can be optimized via partitioning, atomics, or RCU. False sharing is invisible — the code appears thread-local but secretly shares cache lines. Surveyed real production performance bugs at Google and Facebook, false sharing is a top-5 cause of unexpected scalability cliffs. The fix is mechanical (add padding) but locating it requires profiling tools that most engineers haven't learned.