Computer Architecture
MESI Cache Coherence Protocol
Four states, one shared truth — how multicore caches avoid lying to each other
MESI is the cache coherence protocol that keeps multicore CPU caches consistent by tagging every cache line as Modified, Exclusive, Shared, or Invalid, and snooping the bus to drive transitions as cores read and write.
- States per line4 (M, E, S, I)
- Coherence granularity1 cache line (64 B)
- Invalidation policywrite-invalidate
- Snoop bus latency≈ tens of cycles
- State tag overhead2 bits per line
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The problem: four cores, one number, four answers
Every core on a modern CPU has its own private L1 (and usually L2) cache. That's the whole point — a core should never have to cross the chip to fetch a value it just used. But it creates an immediate hazard. If core 0 caches the variable balance = 100 and core 1 also caches it, and then core 0 writes balance = 90 into its cache, core 1 is now holding a stale 100. Without intervention, the two cores disagree about the value of a single memory address. Money disappears.
MESI is the hardware state machine that prevents this. It was introduced by Mark Papamarcos and Janak Patel in 1984 (the "Illinois protocol," after the University of Illinois) and is, in some descendant form, in nearly every multicore chip shipping today. The core idea: a cache line can never be in a state where two caches both believe they may write it, and a write to a line that anyone else holds must first knock out their copies.
The unit of coherence is not the variable — it's the cache line, a fixed-size block (64 bytes on x86-64 and most ARM64) that the cache loads and tracks as one indivisible unit. Every line in every cache carries two state bits identifying which of the four MESI states it's in.
The four states
Each state is a precise claim about (a) whether this cache holds the only copy, and (b) whether the copy matches main memory.
| State | Others may hold it? | Matches memory? | Can write silently? | Meaning |
|---|---|---|---|---|
| Modified | No | No (dirty) | Yes | This cache owns the only valid copy and has changed it. Memory is stale; this cache must write back before anyone else reads. |
| Exclusive | No | Yes (clean) | Yes → becomes M | This cache holds the only copy, unchanged. It can write without telling anyone, silently transitioning to Modified. |
| Shared | Yes | Yes (clean) | No | One of possibly several read-only copies. To write, must first invalidate every other copy. |
| Invalid | — | — | No | The line is empty / not present. A read or write here is a cache miss. |
The two invariants that make MESI correct fall straight out of this table:
- Single-writer invariant: at most one cache can hold a line in M or E. Those are the only states that permit a write, so two simultaneous writers are structurally impossible.
- Clean-shared invariant: a line in S is guaranteed to match memory, so any sharer can be dropped at any time without a write-back.
How transitions actually fire
Transitions are driven by two kinds of events. Processor events are what the local core does: PrRd (read) and PrWr (write). Bus events are what a cache observes another core do, by snooping the shared bus: BusRd (someone wants to read), BusRdX (someone wants to read-with-intent-to-modify), and BusUpgr (someone holding S wants to promote to M).
The decisive moment is a write. When a core writes a line it holds in S, it must broadcast a BusUpgr; every other cache snoops it and drops its copy to Invalid. This is write-invalidate: the writer wins by killing all other copies rather than pushing the new value to them (that alternative, write-update, exists but lost — broadcasting every byte is more traffic than the occasional re-read).
The walk through the canonical sequence:
Core 0 reads X (no one else has it) → I → E (only copy, clean)
Core 0 writes X → E → M (silent! no bus traffic)
Core 1 reads X → BusRd snooped by C0 → C0: M → S, writes back; C1: I → S
Core 1 writes X → BusUpgr/BusRdX → C0: S → I; C1: S → M
Core 2 reads X → BusRd → C1: M → S, writes back; C2: I → S
Notice the E → M step costs nothing on the bus. That single optimization is the entire reason MESI exists rather than the simpler three-state MSI. A private counter that one thread reads then increments stays entirely off the bus after its first load.
Also notice the M → S step on core 0 triggers a write-back (or a cache-to-cache forward): the dirty value has to reach the new reader and memory before the reader is allowed to see it. This is the moment a "free" read by one core forces a memory transaction caused by a write on another.
Why every architect lives with it
MESI isn't something you choose — if you write multithreaded code on a coherent shared-memory machine (which is essentially all mainstream CPUs), you are already running on it. What you choose is how to work with it:
- Keep per-thread data on separate cache lines. The single biggest performance lever. Padding hot per-thread counters to 64 bytes turns a ping-ponging shared line into independent E/M lines.
- Prefer read-mostly sharing. A line read by many cores happily sits in S across all of them with zero bus traffic. It's writes that serialize.
- Batch writes to shared state. A thousand increments to a shared atomic is a thousand round trips through M; a local accumulate-then-one-write is one.
- Understand that atomics aren't free. A
compare-and-swapon a contended line is a full RFO (read-for-ownership = BusRdX) plus the write-back of whoever held it — tens to hundreds of cycles, not the single cycle of an uncontended store.
MESI vs its variants
| MSI | MESI | MOESI | MESIF | Dragon (write-update) | |
|---|---|---|---|---|---|
| States | M, S, I | M, E, S, I | M, O, E, S, I | M, E, S, I, F | E, Sc, Sm, M |
| Read-then-write a private line | 2 bus ops | 1 bus op | 1 bus op | 1 bus op | 1 bus op |
| Dirty line shared without write-back | No | No (must flush to S) | Yes (Owned state) | No | Yes (updates pushed) |
| Who answers a read of a shared line | memory | memory (or any sharer) | owner | exactly one (Forwarder) | any sharer |
| Invalidate vs update | invalidate | invalidate | invalidate | invalidate | update |
| Real-world use | teaching baseline | classic Intel, many ARM | AMD64, some ARM | Intel QPI / UPI multi-socket | rare (high write traffic) |
MOESI adds Owned: a dirty line can be shared with other caches while one cache stays responsible for it, so a write-back to memory is deferred until the line is finally evicted. MESIF adds Forward: when many caches hold a line in S, exactly one is tagged F and is the sole responder to a new read request, so the requester isn't buried under N identical replies. Both are MESI plus one state to cut traffic in the case MESI handles bluntly.
What it costs in real cycles
- L1 hit: about 4 cycles (≈1 ns at 4 GHz). This is the case MESI is fighting to preserve.
- Coherence miss / cache-to-cache transfer: roughly 40–80 cycles on the same die when another core has the line in M and must forward it — an order of magnitude slower than an L1 hit, and that's the good case (same socket).
- Cross-socket coherence miss: 100–300+ cycles once the request crosses a socket interconnect (UPI / Infinity Fabric) — the reason NUMA-aware placement matters.
- False sharing: documented slowdowns of 3× to over 100× on tight loops, depending on contention. Two threads incrementing two ints in the same struct can run slower than one thread doing both, because the line ping-pongs Modified between cores every iteration.
- Tag overhead: 2 bits per 64-byte line = 0.39% of cache capacity spent on state — a rounding error against the bandwidth it saves.
JavaScript model of the state machine
A faithful simulator of the MESI transition table for a single line shared across N caches. It tracks per-cache state and drives the global invariants — exactly what the visualization above animates.
const M = 'M', E = 'E', S = 'S', I = 'I';
class MesiLine {
constructor(numCaches) {
this.state = Array(numCaches).fill(I); // per-cache state for ONE line
this.bus = []; // log of bus transactions
}
// Is any OTHER cache holding a valid (non-I) copy?
othersHold(self) {
return this.state.some((s, c) => c !== self && s !== I);
}
read(c) {
if (this.state[c] !== I) return; // hit: M/E/S all serve reads
const shared = this.othersHold(c);
this.bus.push(`C${c} BusRd`);
// Any owner in M must write back / forward before downgrading to S.
this.state = this.state.map((s, i) =>
i !== c && (s === M || s === E) ? S : s);
this.state[c] = shared ? S : E; // alone ⇒ E, else S
}
write(c) {
if (this.state[c] === M) return; // already writable, silent
if (this.state[c] === E) { this.state[c] = M; return; } // E→M silent upgrade
// S→M needs BusUpgr; I→M needs BusRdX. Either way: invalidate everyone else.
this.bus.push(`C${c} ${this.state[c] === S ? 'BusUpgr' : 'BusRdX'}`);
this.state = this.state.map((s, i) => (i === c ? M : I));
}
// Coherence invariant check: never two writers, M/E are unique.
assertCoherent() {
const writable = this.state.filter(s => s === M || s === E).length;
if (writable > 1) throw new Error('MESI violated: multiple owners');
return true;
}
}
// Walk the canonical sequence
const line = new MesiLine(3);
line.read(0); // I→E (C0 alone)
line.write(0); // E→M (silent)
line.read(1); // C0 M→S (writeback), C1 I→S
line.write(1); // C1 S→M (BusUpgr), C0 S→I
console.log(line.state); // [ 'I', 'M', 'I' ]
line.assertCoherent(); // true
The key correctness details: a read that finds the line in I checks whether anyone else holds it to decide between E (alone) and S (shared); and any write from S or I invalidates every other cache's copy in a single sweep, which is the write-invalidate policy made literal.
Python: the full transition table
Same machine, expressed as an explicit (state, event) → (new_state, bus_action) table — closer to how a textbook draws the MESI state diagram and how a verifier would model it.
from enum import Enum
class St(Enum):
M = 'Modified'; E = 'Exclusive'; S = 'Shared'; I = 'Invalid'
# Local-processor transitions: (state, op, others_hold) -> (new_state, bus_op)
def local(state, op, others_hold):
if op == 'read':
if state != St.I:
return state, None # hit, no bus
return (St.S, 'BusRd') if others_hold else (St.E, 'BusRd')
if op == 'write':
if state == St.M:
return St.M, None # already owner
if state == St.E:
return St.M, None # silent E -> M
if state == St.S:
return St.M, 'BusUpgr' # invalidate sharers
return St.M, 'BusRdX' # I -> M, read-for-ownership
# Snooped-bus transitions: how an OBSERVER reacts to another core's bus op
def snoop(state, bus_op):
if bus_op == 'BusRd': # someone wants to read
if state == St.M: return St.S, 'Flush' # write back / forward dirty
if state == St.E: return St.S, None
return state, None # S stays S, I stays I
if bus_op in ('BusRdX', 'BusUpgr'): # someone wants to write
action = 'Flush' if state == St.M else None
return St.I, action # we must invalidate
class Cache:
def __init__(self): self.state = St.I
class System:
def __init__(self, n): self.caches = [Cache() for _ in range(n)]
def others_hold(self, me):
return any(c.state != St.I for i, c in enumerate(self.caches) if i != me)
def op(self, core, kind):
others = self.others_hold(core)
new, bus = local(self.caches[core].state, kind, others)
if bus: # broadcast to every other cache
for i, c in enumerate(self.caches):
if i != core:
c.state, _ = snoop(c.state, bus)
self.caches[core].state = new
# invariant: at most one M/E across the system
writers = sum(c.state in (St.M, St.E) for c in self.caches)
assert writers <= 1, "coherence violated"
return bus
sys = System(3)
for core, kind in [(0,'read'),(0,'write'),(1,'read'),(1,'write'),(2,'read')]:
bus = sys.op(core, kind)
print(core, kind, '->', [c.state.name for c in sys.caches], '| bus:', bus)
Separating local() (what the owning core does) from snoop() (how observers react) mirrors the real hardware split: the processor side issues requests, and every other cache controller independently decides its reaction by watching the bus.
Variants and the messy reality
MOESI (AMD). Adds Owned — a line can be dirty and shared. The owner holds the canonical dirty copy and supplies it to readers cache-to-cache, deferring the write-back to memory until eviction. Saves a memory write whenever a modified line is read by another core.
MESIF (Intel). Adds Forward — among several S-state sharers, exactly one is the designated forwarder. Without it, a read request to a widely-shared line can be answered by every sharer at once, wasting interconnect bandwidth. F is a "Shared, and it's your job to reply" tag.
Snooping vs directory. MESI is the state machine; how coherence requests are routed is orthogonal. Snooping broadcasts on a shared bus — dead simple, but the bus is a single serialization point that saturates past roughly 8–16 cores. A directory stores, per line, a bitvector of which caches hold it and sends targeted messages, scaling to hundreds of cores at the cost of directory SRAM and an extra hop of latency. Big servers are directory-based; small dies often still snoop.
Store buffers and invalidate queues. Real hardware doesn't block the core while a BusRdX completes. Writes drain into a store buffer and invalidations queue up to be applied later. This makes MESI faster but means a core can see its own writes before others do — which is precisely why x86 is TSO and why you need memory barriers, not just coherence, to reason about ordering.
Common misconceptions and pitfalls
- Thinking coherence equals consistency. MESI guarantees every core agrees on each address eventually and serializes per-address writes. It says nothing about ordering across different addresses. That's the memory model — and store buffers in front of MESI are exactly why you still need fences.
- Ignoring false sharing. The variables don't have to be related. Two unrelated globals laid out 8 bytes apart share a 64-byte line, and two cores writing them ping-pong the line as if they were fighting over the same data. Pad to a line, or use per-CPU storage.
- Assuming reads are always free. A read that downgrades another core's M-state line forces that core to flush. The reader is cheap; the write-back it triggers on the neighbor is not.
- Forgetting the E → M silent transition. When modeling or reasoning, a write to an Exclusive line emits no bus traffic. Treating every write as a bus event overcounts coherence cost on private data dramatically.
- Confusing the protocol with the topology. "MESI doesn't scale" usually means "snooping doesn't scale." The same four states run fine over a directory to hundreds of cores.
- Benchmarking a single thread and extrapolating. Coherence cost is invisible until contention appears. A microbenchmark with no sharing will never show the cliff a contended atomic falls off in production.
Frequently asked questions
What do the four letters in MESI stand for?
Modified, Exclusive, Shared, and Invalid — the four possible states of any cache line in a core's private cache. Modified means this cache owns the only valid copy and it differs from memory (it's dirty). Exclusive means this cache holds the only copy but it matches memory (clean). Shared means several caches may hold identical clean copies. Invalid means the line holds no usable data.
Why does MESI add an Exclusive state instead of just using MSI?
Exclusive lets a core write to a line it already holds clean and alone without broadcasting anything on the bus. Under plain MSI, every first write to a freshly-read line must issue a bus upgrade even when no other cache holds the line. The Exclusive optimization silently promotes E to M, eliminating that bus transaction in the very common read-then-write pattern of private variables.
What is the difference between a snooping and a directory protocol?
MESI is the state machine; snooping and directories are two ways to implement it. Snooping broadcasts every coherence request on a shared bus and each cache watches (snoops) it — simple but the bus saturates past 8 to 16 cores. A directory keeps a per-line list of which caches hold a copy and sends point-to-point messages only to them, scaling to hundreds of cores at the cost of directory storage and lookup latency.
How does MESI cause false sharing?
Coherence tracks whole cache lines (64 bytes on x86 and ARM), not individual variables. If two cores write to two different variables that happen to land on the same line, each write invalidates the other core's copy even though the data never overlaps. The line ping-pongs Modified between caches, turning what looks like independent work into a serialized cache-miss storm — often a 10× to 100× slowdown.
Does MESI guarantee a memory consistency model like sequential consistency?
No. MESI guarantees coherence — every core eventually agrees on a single value per address, and writes to one address are serialized. It says nothing about the ordering of writes to different addresses, which is the memory consistency model's job. Store buffers and invalidate queues sitting in front of MESI are exactly why x86 is TSO rather than sequentially consistent, and why ARM is weaker still.
What replaced MESI in modern Intel and AMD chips?
Extensions, not replacements. Intel uses MESIF, adding a Forward state so exactly one sharer answers read requests instead of all of them flooding the requester. AMD uses MOESI, adding an Owned state so a dirty line can be shared without first writing it back to memory. Both keep the MESI core and add one state to cut bus or interconnect traffic.