Concurrency

Memory Barriers (Fences)

Stop the CPU and compiler from rearranging your code across critical boundaries

A memory barrier (or fence) is an instruction that prevents the CPU or compiler from reordering memory operations across it. Modern CPUs (x86, ARM, POWER) and compilers freely reorder loads and stores for performance — but this breaks lock-free code that depends on a specific observation order. Barriers come in flavors: full barrier (LFENCE, MFENCE, SFENCE on x86), acquire (read barrier), release (write barrier), and the C11/C++11 memory_order_* family. x86 has TSO (Total Store Order) — relatively strong by default; ARM and POWER are weakly ordered, requiring explicit dmb/sync for safety.

TypesAcquire, release, full
x86 instructionsLFENCE, SFENCE, MFENCE
ARMDMB ISH
POWERSYNC
x86 orderingTSO (strong)
ARM orderingWeakly ordered

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why CPUs reorder in the first place

Imagine a CPU executing this:

store data, 42
store flag, 1

The naive expectation is that another core sees the writes in program order. But modern out-of-order processors don't execute in program order — they execute when operands are ready and write to a store buffer that drains to L1 cache lazily. The two stores might leave the buffer in either order. Compilers, meanwhile, may swap independent statements to keep the pipeline full or improve register allocation.

Single-threaded code never notices, because each thread sees its own writes in program order. But across threads, the consumer thread might observe flag == 1 before data == 42 — and read garbage.

Memory barriers tell the CPU and compiler: "Hold on. Don't move anything across this line."

x86: Total Store Order (strong by default)

x86 has a strong memory model called TSO (Total Store Order). The rules:

Loads are not reordered with other loads.
Stores are not reordered with other stores.
A load to address X is not reordered before a store to the same address X.
BUT — a load to X may be reordered before a store to a different address Y. (The store sits in the store buffer, while a later load to a different line completes from cache.)

That last rule is what makes x86 not sequentially consistent. The Dekker's algorithm test exposes it: two threads each write a flag and then read the other's flag; on x86 both can read 0, even though one flag must be 1 in any sequentially-consistent execution.

x86 fences:

SFENCE — store fence. Stores before SFENCE complete before stores after.
LFENCE — load fence. Loads before LFENCE complete before loads after. Less commonly needed because loads aren't reordered with each other anyway.
MFENCE — full fence. Forbids store-load reorder. Used to make a Dekker-style algorithm correct.
LOCK-prefixed instructions — implicitly act as full fences.

For most lock-free code on x86, only seq_cst stores need MFENCE; acquire/release ordering compiles to plain MOV.

ARM and POWER: weakly ordered

ARM and POWER allow virtually any reordering not forbidden by data dependencies. Two unrelated loads on ARM can be reordered. Two unrelated stores can be reordered. Stores can pass loads. The hardware exploits this for scheduling flexibility.

This means ARM lock-free code that worked silently on x86 will break. The Linux kernel had a multi-year bug-hunt in the 2010s as ARM-specific weak-memory bugs surfaced on Android phones.

ARM fences:

DMB ISH — Data Memory Barrier, inner shareable. Full memory barrier within the multi-core domain. The default lock-free fence.
DMB ISHLD — load-only barrier (acquire-style).
DMB ISHST — store-only barrier (release-style).
LDAR / STLR — Load-Acquire / Store-Release. Single instructions that combine the load/store with the appropriate barrier. Cheaper than separate DMB.

POWER uses SYNC for full barrier, LWSYNC for the acquire-release class, and ISYNC for instruction barrier. The semantics are roughly analogous to ARM's family.

Acquire-release: the universal pattern

Almost every cross-thread synchronization can be expressed as an acquire-release pair. Producer:

data = 42;                          // ordinary write
flag.store(1, memory_order_release); // release fence on store

Consumer:

while (flag.load(memory_order_acquire) == 0) /* spin */;
// acquire fence ensures we see the data write
print(data); // 42

The release barrier on the store guarantees the producer's prior write to data is visible to anyone who later sees flag == 1. The acquire barrier on the load guarantees the consumer's later read of data happens after seeing the flag. Together they form a happens-before relationship — the foundation of every lock and lock-free algorithm.

On x86 this compiles to plain MOVs (TSO already provides the ordering). On ARM, the store gets STLR and the load gets LDAR — single instructions with built-in fence semantics, cheaper than separate DMB.

C++11 memory orders

C++11 standardized six memory orders so portable code doesn't have to know fence instructions:

relaxed — atomicity only, no ordering. Cheapest.
consume — data-dependency ordering. In practice all compilers promote this to acquire because tracking dependencies precisely is hard.
acquire — used for loads. Subsequent operations don't move before.
release — used for stores. Prior operations don't move after.
acq_rel — used for read-modify-write. Combines both.
seq_cst — sequentially consistent. Strongest. Default.

The compiler emits the right barriers for each order on each architecture. Java has a similar (slightly stricter) model with volatile, AtomicX, and the JMM. The lesson: think in memory orders, not in fence instructions, unless you're writing inline assembly.

Why barriers matter

Lock-free correctness. Every lock-free queue, stack, and hash map relies on acquire/release pairs to publish data safely.
JVM volatile. Java's volatile keyword post-JMM is essentially a seq_cst atomic — under the hood, the JIT emits MFENCE on x86 and DMB on ARM where needed.
Kernel locking. Linux uses smp_mb(), smp_rmb(), smp_wmb() macros that compile to architecture-specific fences. A missing barrier in spinlock code is one of the hardest classes of kernel bugs to diagnose.
Weak-memory ARM bugs. Code written and tested on x86 may silently break on ARM/Android. The Android kernel team found dozens of such bugs in the 2010s.
Compiler optimization. Without barriers, the compiler can hoist loads out of loops, sink stores, or reorder independent operations — always correct for single-threaded but wrong across threads.
Hardware-software contract. Memory models are how the architecture vendor and language vendor agree on what programmers can rely on.

Common misconceptions

x86 needs no barriers. Wrong. x86 still needs MFENCE for store-load ordering, e.g., the read-after-store in Dekker's algorithm. Most acquire/release patterns are free, but not seq_cst.
Fences flush caches. No. Cache coherence is automatic and continuous — caches never need flushing for correctness. Fences only constrain the order in which operations become visible. They don't speed up or slow down cache traffic; they constrain it.
volatile is a fence. In C/C++ no — volatile prevents compiler reordering only. In Java post-1.5, volatile does provide acquire/release semantics. Don't carry the Java intuition into C++ or vice versa.
Acquire and release are independent of atomicity. They're tied: only atomic operations have memory_order. A non-atomic write before a release atomic is ordered, but the non-atomic itself isn't atomic.
Fences are slow. On x86, MFENCE is ~30 cycles; LOCK-prefixed instructions are similar. On ARM, DMB ISH is 5-15 cycles, LDAR/STLR are 1-3 cycles each. Not free, but not catastrophic. The contention cost dwarfs the fence cost in practice.
Compiler barriers are CPU barriers. A compiler barrier (asm volatile in C) is purely compile-time. It doesn't generate any instruction. A CPU barrier is a real instruction. You need both to fully forbid reorder.

Frequently asked questions

What is the difference between a CPU and compiler barrier?

A compiler barrier prevents the compiler from moving memory operations across the barrier in the generated code — but generates no instruction. GCC's asm volatile ("" ::: "memory") is a pure compiler barrier. A CPU (hardware) barrier is an actual instruction (MFENCE, DMB, SYNC) that prevents the processor's out-of-order execution and write-buffering hardware from reordering across it. You usually need both, because a release/acquire pattern requires that neither the compiler nor the CPU reorder past the barrier. C++11 atomic operations with non-relaxed ordering emit both.

Why does x86 not need read barriers in most cases?

x86 has TSO (Total Store Order): loads are not reordered with other loads, stores are not reordered with other stores, and loads are not reordered with prior stores to the same location. The only reordering allowed is store-load (a load can complete before an earlier store to a different address, since the store sits in the store buffer). Acquire-on-load and release-on-store therefore need no fence on x86 — plain MOVs suffice. Only seq_cst (which forbids store-load reorder) needs MFENCE or a LOCKed instruction. ARM and POWER allow far more reordering and need explicit barriers.

What is acquire-release semantics?

An acquire load is a barrier that prevents subsequent reads/writes from being reordered before it — anything you do after seeing the acquire load is guaranteed to happen after. A release store is a barrier that prevents prior reads/writes from being reordered after it — anything you did before the release store is visible to readers who acquire-load the same flag. Together they implement a producer/consumer handshake: producer writes data, then release-stores a flag; consumer acquire-loads the flag, then reads the data and sees the producer's writes. This is the foundation of message passing in lock-free code.

How does C++11's memory_order map to barriers?

memory_order_relaxed: no fence — just atomic. memory_order_acquire (load only): on x86, no fence. On ARM, DMB ISHLD or LDAR. memory_order_release (store only): on x86, no fence. On ARM, DMB ISH or STLR. memory_order_acq_rel (RMW): combines both. memory_order_seq_cst: x86 emits MFENCE or LOCKed instruction; ARM emits DMB ISH and uses LDAR/STLR. Compilers also emit compiler barriers on every non-relaxed atomic. The standard guarantees these orderings; the actual instructions are platform-specific.

What is the double-checked locking bug?

Classic singleton pattern: if (instance == null) { lock { if (instance == null) instance = new Singleton(); } }. The bug: the constructor and the assignment to instance can be reordered. Another thread checking instance != null may see a non-null pointer to an under-constructed object. This was famously broken in Java pre-JDK 1.5 — fixed by declaring instance as volatile, which under the new memory model adds a release fence between the construction and the publish. C++11 fixes it with std::atomic and memory_order_release for the publish, memory_order_acquire for the check.

Are barriers the same as memory ordering?

Barriers are the implementation; memory ordering is the language-level model. You write code with std::atomic operations and memory_order parameters; the compiler emits whatever barriers and special atomic instructions the target architecture needs to honor that order. On x86 (TSO), most ordering needs no barrier instruction. On ARM (weakly ordered), the same C++ code emits DMB and LDAR/STLR. The C++11/Java/C# memory models give you a portable abstraction over the hardware barrier zoo. You should think in terms of memory orders, not specific fences, except when writing inline assembly.