Concurrency

Memory Consistency Models

The contract that decides whether your two threads agree on what happened

A memory consistency model is the contract between hardware and software defining which reorderings of reads and writes one thread may observe in another — from strict sequential consistency to relaxed weak and release-acquire models.

  • StrongestLinearizability
  • Textbook idealSequential consistency
  • x86 hardwareTSO
  • ARM / POWERWeak / relaxed
  • SC cost vs relaxed2×–10×

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why two threads can disagree about the past

Write a simple flag-and-data handshake and reason about it the way you learned in school: line by line, top to bottom. Thread 1 sets data = 42 then sets ready = 1. Thread 2 spins until it sees ready == 1, then reads data. Obviously it reads 42. Except on real hardware it can read 0 — the stale value — and on real hardware it does, often enough to corrupt production systems. The reason is that there is no global "the program" that both threads observe. Each thread sees the others' memory operations through a model that permits certain reorderings.

A memory consistency model (or memory model) is the precise specification of which reorderings are allowed. It answers one question: given the writes one thread performs, what set of values may another thread's reads legally return? Everything else — store buffers, out-of-order execution, cache hierarchies, compiler optimizations — is an implementation detail hidden behind that contract. The model is the only thing a portable program is allowed to depend on.

There are really two memory models stacked on top of each other, and both must hold for code to be correct. The hardware memory model describes what the CPU may reorder (x86's Total Store Order, ARM and POWER's weak ordering). The language memory model describes what the compiler may reorder and what the source language promises (the C++11 model, the Java Memory Model of 2004, Go, Rust). The compiler is free to reorder anything the language model permits, and the CPU is free to reorder anything the hardware model permits, so the effective order you observe is the composition of both.

The hierarchy: from sequential consistency down to weak

Leslie Lamport defined the canonical strong model in 1979. A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors were executed in some single sequential order, and the operations of each individual processor appear in that sequence in the order specified by its program. Two clauses: (1) a single total order exists, and (2) it respects each thread's program order. There is no notion of real time — only the existence of a legal interleaving.

Linearizability (Herlihy and Wing, 1990) strengthens this with a real-time axiom: if operation A completes before operation B begins in wall-clock time, A must precede B in the total order. Sequential consistency permits a read to return a value that the wall clock says is already overwritten; linearizability forbids it. Linearizability is also composable — combine two linearizable objects and the whole is still linearizable — which sequential consistency is not, and that is why concurrent data-structure correctness is almost always stated as linearizability.

Below sequential consistency sit the relaxed models, each named for the reordering it permits. The classic taxonomy is by which store→load, store→store, load→load, and load→store pairs may be reordered:

  • Total Store Order (TSO) — what x86 actually provides. Each core has a FIFO store buffer; its own stores can be delayed past its own later loads (store→load reordering), but nothing else is reordered and all cores agree on the order of stores. This single relaxation is what breaks the naive Dekker / store-buffer litmus test.
  • Partial Store Order (PSO) — also lets independent stores to different locations reorder (store→store). Historical SPARC mode.
  • Release Consistency / weak ordering — ARM, POWER, RISC-V. Almost everything reorders freely between synchronization points; you regain order only by issuing explicit fences or release/acquire operations.

The crucial modern abstraction is the happens-before relation. It is the partial order built from program order within a thread plus synchronization edges between threads (a release store paired with an acquire load of the same atomic, lock release paired with lock acquire). If two conflicting accesses are not ordered by happens-before, you have a data race, and in C++ and Java a data race is undefined behavior — the model makes no promise at all. Correctness is therefore reframed: don't ask "what order do my writes appear in," ask "is there a happens-before edge."

Choosing an ordering in practice

  • Reach for seq_cst first. It is the C++ default, it is the only model most people can reason about correctly, and the cost is invisible until profiling says otherwise. Premature relaxation is the single biggest source of broken lock-free code.
  • Use release/acquire to publish data. The producer writes the payload, then a release store flips the flag; the consumer does an acquire load of the flag, and now sees the payload. This is the correct, portable replacement for the broken naive handshake above.
  • Use relaxed only for standalone counters. A statistics counter or a reference count's increment (not the final decrement) needs atomicity but publishes no other state, so memory_order_relaxed is safe and fast.
  • Use a full fence sparingly. A standalone atomic_thread_fence(seq_cst) is the heavy hammer for the rare store-buffer pattern that release/acquire cannot express.

The models compared

LinearizableSequential (SC)TSO (x86)Release/AcquireRelaxedWeak (ARM)
Single total order?Yes + real timeYesStores onlyNoNoNo
Respects program order?YesYesExcept store→loadAt sync pointsPer-location onlyOnly with fences
Store→load reorder?NoNoYesYesYesYes
Composable?YesNoNoNoNo
Typical costHighestHigh (full fence)Cheap loads, fenced RMW1 cheap barrierCheapestCheapest + manual fences
Where it livesConcurrent DS specReasoning idealx86-64 hardwareC++/Java atomicsC++ countersARM/POWER/RISC-V HW

The headline relationship: linearizable ⊂ sequentially consistent ⊂ TSO ⊂ relaxed in terms of how much behavior each one permits. Stronger models permit fewer executions and are easier to reason about; weaker models permit more executions and run faster. The language models (release/acquire, relaxed) are how you opt into a chosen point on that spectrum independent of the hardware underneath.

What the numbers actually say

  • x86 store→load latency is the whole game. A store sits in the per-core store buffer (typically 56–72 entries on recent Intel and AMD cores) and can retire before it reaches L1, while later loads proceed. A mfence or lock-prefixed instruction drains it, costing on the order of 20–40 cycles even when uncontended.
  • seq_cst vs relaxed on a hot atomic: 2×–10×. On x86, a relaxed load is a plain mov; a seq_cst store compiles to xchg (an implicit lock) or store+mfence. On ARM the gap is wider because seq_cst needs dmb ish barriers that a relaxed access skips entirely.
  • Naive flag handshakes fail at measurable rates. The classic store-buffer litmus test (two threads each storing then loading the other's flag) returns the "impossible" r1 == r2 == 0 outcome on real x86 hardware — not rarely, but reliably enough to be the first thing the litmus7 tool demonstrates.
  • One acquire barrier on ARM is one ldar. Release/acquire maps to a single dedicated instruction (stlr/ldar) rather than a general fence, which is why publishing data through release/acquire is dramatically cheaper than wrapping it in seq_cst.

JavaScript: the store-buffer race, made observable

JavaScript's SharedArrayBuffer + Atomics implement a memory model closely modeled on C++11. Plain (non-atomic) writes to a shared buffer may be reordered and cached; atomic operations create the happens-before edges. The snippet below shows the broken handshake and its fix.

// shared = new Int32Array(new SharedArrayBuffer(8))
// index 0 = data, index 1 = ready

// ── BROKEN: plain writes, no ordering edge ──
function producerBad(shared) {
  shared[0] = 42;          // plain store — may be reordered after the next line
  shared[1] = 1;           // plain store — consumer may see this first
}
function consumerBad(shared) {
  while (shared[1] === 0) {} // plain load — may even be hoisted out of the loop!
  return shared[0];          // can legally observe 0
}

// ── CORRECT: the flag is the synchronization point ──
function producerGood(shared) {
  shared[0] = 42;                 // payload, written first in program order
  Atomics.store(shared, 1, 1);    // release-style publish of the flag
}
function consumerGood(shared) {
  while (Atomics.load(shared, 1) === 0) {  // acquire-style observe of the flag
    Atomics.wait(shared, 1, 0);            // park instead of busy-spin
  }
  return shared[0];               // happens-before guarantees we see 42
}

The fix does not change what is written — it changes the edge. Atomics.store followed by a matching Atomics.load establishes happens-before, so the prior plain write to shared[0] is guaranteed visible. Note also the broken consumer's while loop on a plain variable: the engine may cache shared[1] in a register and spin forever, the JavaScript equivalent of the missing-volatile bug.

C++: the same handshake with explicit orderings

Python's CPython runs under the GIL and exposes no fine-grained model, so the canonical example is C++11 atomics, where each ordering is named explicitly. This is the message-passing idiom every memory model textbook starts with.

#include <atomic>
#include <thread>
#include <cassert>

int   payload = 0;              // plain, non-atomic
std::atomic<bool> ready{false}; // the synchronization variable

void producer() {
    payload = 42;                                  // (A) plain write
    ready.store(true, std::memory_order_release);  // (B) release: A cannot move after B
}

void consumer() {
    // (C) acquire: nothing after C can move before it
    while (!ready.load(std::memory_order_acquire)) { /* spin */ }
    assert(payload == 42);  // guaranteed: A happens-before this read
}

int main() {
    std::thread t1(producer), t2(consumer);
    t1.join(); t2.join();
}

Three orderings appear in practice. release on the store and acquire on the load form the happens-before edge that publishes payload safely. Had both used memory_order_relaxed, the assert could fire: the flag would still be atomic, but it would carry no ordering for the surrounding plain write. Had both used the default memory_order_seq_cst, the code would also be correct — just with a stronger, more expensive guarantee than this one-way handoff needs.

The notorious counter-example is independent reads of independent writes (IRIW): four threads, two writing distinct flags and two reading both in opposite order. Under release/acquire the two readers can disagree about which write happened first; only seq_cst on all four operations forces them to agree, because only seq_cst supplies a single total order.

Variants and real-world models worth knowing

The Java Memory Model (JSR-133, 2004). The first mainstream language model defined rigorously in terms of happens-before. volatile fields get release/acquire (and, since Java 5, seq_cst-style) semantics; final fields get special publication guarantees so safely-constructed immutable objects are visible without synchronization.

The C++11 / C11 model. Introduced the explicit memory_order enum and the sequenced-before / synchronizes-with / happens-before machinery, plus the troublesome memory_order_consume (dependency ordering) that no compiler implements well and is effectively deprecated.

Release Consistency (Gharachorloo et al., 1990). The academic model that split synchronization into acquire and release halves and directly inspired the C++ names. It powered the DASH multiprocessor and lazy-release-consistency software DSM systems.

Eventual / causal consistency. The distributed-systems cousins. Eventual consistency (Dynamo, many NoSQL stores) promises only that replicas converge if writes stop; causal consistency preserves the happens-before order of causally related operations while allowing concurrent ones to be seen in any order — the distributed analogue of release/acquire.

Linearizability vs serializability. Easy to confuse: linearizability is a guarantee about single objects under real time; serializability is a database transaction guarantee that a set of transactions is equivalent to some serial schedule, with no real-time clause. Strict serializability is their combination.

Common bugs and edge cases

  • Assuming sequential consistency on relaxed hardware. Code that "works on my x86 laptop" routinely breaks on ARM because TSO masked a missing barrier. Test on weak hardware or with a model checker, not just on Intel.
  • The double-checked locking bug. Publishing a pointer to a half-constructed object: the consumer sees the non-null pointer (the store) before the constructor's writes (the payload) under any model weaker than the one the publishing store enforces. The fix is a release store of the pointer paired with an acquire load.
  • A data race is undefined behavior, not "a stale read." Once you race a non-atomic, the compiler may fuse, tear, or invent reads. The bug is not a wrong value; it is that all bets are off. Make it atomic.
  • Treating volatile (C/C++) as a thread barrier. C/C++ volatile prevents compiler caching for memory-mapped I/O but imposes no inter-thread ordering and is not atomic. Use std::atomic. (Java volatile is the exception — it does order.)
  • Relaxed used where release/acquire was needed. A relaxed flag is atomic but publishes nothing; the consumer can see the flag set and still read stale payload. Relaxed is for self-contained counters only.
  • Forgetting that seq_cst is not mutual exclusion. Two threads each doing a seq_cst load, add, seq_cst store still lose updates. You need an atomic read-modify-write (compare-and-swap or fetch-add), not a stronger ordering.

Frequently asked questions

What is the difference between sequential consistency and linearizability?

Both require a single global order of operations that respects each thread's program order. Linearizability adds a real-time constraint: if operation A finishes before operation B starts in wall-clock time, A must come before B in the global order. Sequential consistency has no such clock — it only preserves per-thread order, so a read can legally return a value that is already stale by the time it returns.

Why don't modern CPUs just provide sequential consistency?

Enforcing sequential consistency would force every store to drain the store buffer and every load to wait, killing the out-of-order and store-buffering optimizations that give CPUs most of their throughput. Measurements on x86 and ARM show full sequential consistency can cost 2× to 10× on synchronization-heavy code, so vendors expose relaxed models (TSO on x86, very weak ordering on ARM and POWER) and let software insert fences only where it actually needs them.

What is a data race and why is it undefined behavior in C++?

A data race is two accesses to the same non-atomic location from different threads, at least one a write, with no happens-before edge ordering them. The C++11 and Java memory models declare a data race undefined behavior precisely so the compiler and CPU are free to reorder, fuse, and cache non-atomic accesses. The fix is not a stronger model — it is to make the conflicting accesses atomic or to separate them with a synchronizing operation.

What does release-acquire ordering actually guarantee?

A release store and a later acquire load of the same atomic variable create a happens-before edge: everything the releasing thread wrote before the release becomes visible to the acquiring thread after the acquire. It is strictly weaker than sequential consistency because it does not impose a single total order across all atomics — two independent release-acquire pairs can be observed in different orders by different threads.

What is the difference between memory_order_seq_cst and memory_order_relaxed?

seq_cst is the default in C++ atomics: it gives release-acquire semantics plus a single total order over all seq_cst operations, usually compiled to a full fence or a locked instruction. relaxed guarantees only atomicity and per-location modification order — no ordering relative to other variables — so it is the cheapest, suitable for counters and flags where you do not publish other data through them.

Does a sequentially consistent model prevent all concurrency bugs?

No. Sequential consistency only removes surprising reorderings; it does not remove atomicity violations or races at the algorithm level. Two threads each running read-modify-write as separate load and store can still lose updates under perfect sequential consistency. You still need atomic read-modify-write instructions or locks for correctness — the model governs visibility and ordering, not mutual exclusion.