Systems

Cache Coherence

How private L1 caches conspire to look like one shared memory.

Every core on a modern CPU has its own L1 cache. Without coordination, two cores could hold two different versions of the same memory address. Cache coherence protocols — MESI, MOESI, MESIF — define how caches snoop on each other's traffic, transition between states, and ensure the whole system agrees on a single value per address. The cost of getting this wrong (or of making compilers do it for you accidentally) is false sharing: throughput collapsing because two unrelated variables share a 64-byte cache line.

  • Cache line size64 B (x86), 128 B (Apple M)
  • Invalidation cost (intra-socket)~30 cycles
  • Cross-socket coherence~100–300 cycles
  • L1 hit~4 cycles
  • MESI statesM, E, S, I

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why we need a protocol at all

An L1 data cache is private to one core. A line in core 0's L1 cache costs ~4 cycles to read; the same data fetched from main memory costs ~200. Without private caches, every memory access on every core would hit DRAM, and modern multicores would be impossibly slow.

The price of private caches is that the same address can live in many places. Suppose core 0 reads x into L1, then core 1 writes x in its own L1. If core 0 reads x again from L1, it sees the stale value. The hardware needs a protocol to detect this and either invalidate or update copies before the program can observe the inconsistency. That protocol is cache coherence.

MESI: the canonical 4-state protocol

Each line in each cache is in one of four states. Transitions are driven by local CPU operations and by snooped messages from peer caches.

StateMeaningOther caches?Memory in sync?
ModifiedThis cache has the only copy, dirtyNoneNo (must writeback)
ExclusiveOnly copy, cleanNoneYes
SharedMultiple clean copies existOne or moreYes
InvalidLine is stale; must refetch on access

The state diagram in plain prose:

  • A core reading a line not in its cache snoops the bus. If no peer has it: I → E. If peers have it Shared: I → S, peers stay S.
  • A core writing a line in S broadcasts an "Invalidate" — peer S → I, local S → M.
  • A core writing a line in E upgrades silently: E → M. No bus traffic. (This is why hot per-thread variables are fast.)
  • A peer's read of a line you hold in M: you supply the data and writeback. M → S (or M → O in MOESI).
  • Eviction of a line in M writes back to memory.

MESI variants — MOESI, MESIF, and the rest

ProtocolStatesUsed byWhy
MSIM, S, ITextbooksSimplest correct protocol; over-broadcasts compared with MESI
MESIM, E, S, IIntel pre-NehalemAdds Exclusive — silent E→M upgrade for thread-private data
MOESIM, O, E, S, IAMD, ARMOwned state lets a dirty line be shared without writing back
MESIFM, E, S, I, FIntel Nehalem+, QPI/UPIForward state — exactly one Shared copy is the responder, avoiding redundant supply
Dragon / FireflyUpdate-based variantsHistorical RISCUpdates rather than invalidates; rare today
Directory-basedVariableNUMA, multi-socketPer-line ownership directory; scales beyond shared-bus snoop

Real CPUs blur these labels. Modern Intel server chips combine snooping with a directory in the L3 to track which cores hold each line; AMD's Infinity Fabric uses MOESI on top of a directory for cross-CCD traffic. The textbook protocol is rarely what's actually shipping.

False sharing: the most common coherence bug in user code

Coherence operates on cache lines (64 bytes on x86, 128 on Apple Silicon), not on individual variables. If two unrelated variables sit on the same line and two cores hammer them, every write invalidates the peer's line — even though, logically, the variables are independent. The line ping-pongs between caches.

Demonstrate it with a trivial benchmark in C:

// false_sharing.c — gcc -O2 -lpthread false_sharing.c -o fs
#include <pthread.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>

#define N 100000000
struct shared { uint64_t a; uint64_t b; };               // same line
struct padded { uint64_t a; char _[56]; uint64_t b; };   // separate lines

void* hammer(void* p) { volatile uint64_t* x = p; for (int i=0;i<N;i++)(*x)++; return 0; }

static double bench(uint64_t* a, uint64_t* b) {
  struct timespec t0, t1; pthread_t t;
  clock_gettime(CLOCK_MONOTONIC, &t0);
  pthread_create(&t, 0, hammer, a); hammer(b); pthread_join(t, 0);
  clock_gettime(CLOCK_MONOTONIC, &t1);
  return (t1.tv_sec-t0.tv_sec)+(t1.tv_nsec-t0.tv_nsec)/1e9;
}

int main(void) {
  struct shared s = {0,0}; struct padded q = {0,{0},0};
  printf("shared: %.2f s\npadded: %.2f s\n", bench(&s.a,&s.b), bench(&q.a,&q.b));
}

Typical result on a Skylake desktop: shared ~3.5 s, padded ~0.5 s. A 6–10× slowdown from a 56-byte difference in struct layout.

The fix is to align hot, mutable, per-thread fields to a full cache line:

// C++17
struct alignas(64) Counter { std::atomic<uint64_t> value; };
Counter counters[NUM_THREADS];   // each on its own line

// C / GCC
struct __attribute__((aligned(64))) Counter { _Atomic uint64_t value; };

The same principle in Python uses numpy arrays with strided layouts, or simply enough padding when sharing memory between processes via multiprocessing.shared_memory:

import numpy as np
# Each thread's counter on its own 64-byte line
arr = np.zeros(num_threads * 8, dtype=np.uint64)   # 8 uint64 = 64 bytes
# Thread i writes to arr[i*8], leaving arr[i*8+1..i*8+7] as padding

JavaScript on shared memory follows the same rule. Inside a SharedArrayBuffer shared between Workers, two atomic counters at consecutive offsets ping-pong. Keep at least 64 bytes of stride:

// Bad: counters at offsets 0 and 8 — same cache line
const sab = new SharedArrayBuffer(16);
const view = new BigUint64Array(sab);
// Good: stride by 8 BigUint64s = 64 B per counter
const padded = new BigUint64Array(new SharedArrayBuffer(16 * 8));
const counterFor = i => i * 8;

Snoopy vs directory protocols

Snoopy

Every cache controller listens on a shared bus or ring. When core 0 writes a line, the message reaches every other cache in the same broadcast domain; each one checks if it holds a copy and reacts. Simple, low-latency for small core counts. The catch: bandwidth scales linearly with cores while bus capacity does not. Past about 16 cores per domain, snoopy breaks.

Directory

A directory entry per memory line records which caches hold copies and in what state. On a write, only the involved caches receive point-to-point invalidate messages. Scales to hundreds of cores. Cost: an extra hop through the directory (often co-located with the L3 home agent), and substantial silicon for the directory itself. Modern AMD EPYC and Intel Xeon server chips are directory-based; consumer multicores still mostly snoop within a die.

Costs you should remember

  • Hit in L1, line in E or M: ~4 cycles. The fast path. No bus traffic.
  • Read upgrade I → S (line in another core's L1): ~30 cycles within socket.
  • Write upgrade S → M (Read-For-Ownership broadcast): ~30 cycles within socket; ~150 cycles cross-socket.
  • Cache-line ping-pong: two cores writing the same line take ~30 cycles per round trip — sustained throughput collapses to bus-limited. A counter under contention is roughly 100× slower than the same counter unshared.
  • Cross-socket NUMA coherence: the worst case at 100–300 cycles; this is why pinning hot threads to one socket can double throughput.

Common pitfalls

  • False sharing of hot counters. Per-thread stats arrays without padding. Use alignas(64) or pad to a full line.
  • Atomics aren't free. std::atomic<int> increments cost ~30 cycles when contended, ~5 uncontended. The keyword doesn't change the coherence cost.
  • Read-modify-write of large structs. Modifying one field invalidates the whole line — every other field on that line is now cold for any other reader.
  • Flag-plus-data on one line. A "ready" flag adjacent to the data it advertises is fine on x86 (TSO) but races on weakly ordered ARM. Use explicit barriers or split lines.
  • NUMA-blind allocation. A page first-touched on socket 0 lives in socket 0; threads on socket 1 pay cross-socket coherence forever. Use numactl --interleave.
  • Cache-line size assumptions. Apple M-series uses 128-byte lines. alignas(64) isn't enough — use std::hardware_destructive_interference_size.

When to care

If you're writing typical application code, the answer is "rarely" — the JIT, allocator, and language runtime hide it. If you're writing lock-free data structures, high-throughput counters, NUMA-aware servers, or anything that benchmarks at 100s of millions of ops/sec, cache coherence becomes the dominant cost model. Profile with perf c2c on Linux to see exactly which lines are bouncing between which cores.

Frequently asked questions

What does cache coherence guarantee?

That every core sees a single, agreed-upon value for any given memory address — eventually. A write by one core will be observed by reads from every other core, in some order consistent across the system. Coherence does not by itself guarantee a particular ordering between writes to different addresses; that's the job of the memory consistency model (TSO on x86, weaker on ARM).

What is MESI?

MESI is the canonical 4-state cache coherence protocol: Modified (this core has the only dirty copy), Exclusive (only copy, clean), Shared (multiple clean copies exist), Invalid (line is stale, must refetch). Each state transition responds to local reads/writes and to bus messages from peer caches.

What is false sharing?

Two threads modify two different variables that happen to live on the same 64-byte cache line. From the CPU's perspective, every write invalidates the line in the other core's cache, even though logically the variables are independent. Throughput collapses to memory-bus speed. Pad hot per-thread variables to a full cache line, or use alignas(64) / __cacheline_aligned.

What's the difference between snoopy and directory protocols?

In a snoopy protocol every cache listens to a shared bus and reacts to read/write transactions from peers. Simple and fast for small core counts but doesn't scale past ~16 cores because the bus becomes a bottleneck. Directory protocols keep a directory entry per memory line that tracks which caches hold copies, so messages are sent point-to-point. Used in big-iron NUMA systems and modern AMD/Intel server interconnects.

Roughly how expensive is a cache-line invalidation?

Within a single socket: about 30–40 cycles when the source line is in another core's L1 (about 10 ns at 3.5 GHz). Cross-socket NUMA traffic costs 100–300 cycles. Compare with an L1 hit at ~4 cycles and you can see why bouncing a hot cache line ('cache-line ping-pong') is one of the most punishing performance pathologies in concurrent code.

What does MOESI add over MESI?

MOESI adds an Owned state for AMD and ARM systems: the line is dirty (modified) but other caches may also hold (potentially stale) Shared copies. The owner is responsible for writeback. It saves a write-to-memory hop compared with MESI's 'must transition to Modified, then writeback before sharing.'