Concurrency
Atomic Operations
Read-modify-write that completes indivisibly — fetch-add, compare-and-swap, load-link/store-conditional
An atomic operation is a CPU instruction (or sequence) that completes as an indivisible unit — no other thread can observe an intermediate state. Common atomics: load, store, fetch-add (atomic increment), exchange, compare-and-swap (CAS), and load-linked/store-conditional (LL/SC) on RISC architectures. They form the basis of lock-free programming and the C11/C++11 <atomic> library. On x86, atomicity is enforced by the LOCK prefix (~10-30 cycles); on ARMv8, by LDXR/STXR pairs. Atomic operations are the only way to coordinate threads without taking a lock — but they're slower than non-atomic memory access by ~10× under contention.
- Latency10-30 cycles
- TypeRead-modify-write
- x86LOCK prefix
- ARMLDXR/STXR pair
- StandardizedC11, C++11
- ComposeNot for free (linearization rules)
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
What "atomic" actually means
The dictionary definition is "indivisible." In concurrency, an operation is atomic if no thread can ever observe it half-done. A non-atomic 64-bit store on a 32-bit machine, by contrast, can be observed mid-flight: another core might see the lower 32 bits updated and the upper 32 bits stale. Atomicity rules out exactly that.
The C11/C++11 standards formalize a small alphabet of atomic operations:
- Atomic load. Read a value with no torn reads.
- Atomic store. Write a value with no torn writes.
- Exchange. Atomically swap a memory location with a value, returning the old value.
- Compare-and-swap (CAS). Conditional swap based on expected value.
- Fetch-add / fetch-sub / fetch-or / fetch-and / fetch-xor. Atomic read-modify-write that returns the old value.
- Wait/notify. Block until a value changes — added in C++20.
Each of these is one instruction on x86 (with LOCK prefix) or a small LL/SC sequence on ARM/POWER/RISC-V. The standard library wraps them so portable code looks identical regardless of architecture.
x86 LOCK vs ARM LL/SC
Two architectural philosophies dominate.
x86: The LOCK prefix on instructions like ADD, OR, XCHG, CMPXCHG turns a non-atomic instruction into an atomic one. The hardware acquires exclusive cache-line ownership, performs the operation, and releases. XCHG on x86 has implicit LOCK semantics — there's no non-atomic exchange. Intel guarantees that a LOCK-prefixed instruction is atomic with respect to all other LOCK-prefixed instructions on any core, regardless of cache-line alignment (though aligned access is much faster).
ARM (and other RISCs): ARM uses load-linked/store-conditional (LL/SC) — paired instructions LDXR (load exclusive) and STXR (store exclusive). LDXR reads and "reserves" the line; STXR writes only if the reservation is still intact. If anything between the LDXR and STXR could have invalidated the reservation (another core's write, a context switch, an interrupt), STXR fails and you must retry. This is more general — you can compute any function between LDXR and STXR — but more error-prone, since spurious failures are possible. ARMv8.1 added a direct CAS instruction (CASAL) as an addition, recognizing that pure LL/SC scaling on heavy contention was poor.
Memory ordering matters as much as atomicity
Atomicity prevents partial observations of one location. But threads care about the order of operations across multiple locations — a producer wants the consumer to see its data write before its flag write. C++11 expressed this with six memory orders, ordered from weakest to strongest:
- memory_order_relaxed — atomicity only, no ordering. Cheapest, used for counters where you don't need to synchronize-with anything else.
- memory_order_consume — data-dependency ordering. Almost universally implemented as acquire because compilers can't reliably track dependencies.
- memory_order_acquire — used on the load side of synchronization. Subsequent reads/writes can't be reordered before the acquire load.
- memory_order_release — used on the store side. Prior reads/writes can't be reordered after the release store.
- memory_order_acq_rel — both. Used for read-modify-write that participates in synchronization on both sides.
- memory_order_seq_cst — sequentially consistent. Strongest; full barriers; default for std::atomic operations.
On x86, relaxed and acquire/release loads compile to plain MOV — TSO already provides those guarantees. On ARM, weaker orders save real fence cycles. The C++ default of seq_cst is portable but conservative; performance-tuned lock-free code uses acq_rel or relaxed where the algorithm allows.
fetch-add and atomic counters
Atomic counters use fetch_add directly:
std::atomic<uint64_t> counter{0};
uint64_t observe_event() {
return counter.fetch_add(1, std::memory_order_relaxed);
}
On x86, this compiles to lock xadd — one instruction. On ARM (pre-v8.1), it compiles to a small LL/SC loop:
retry:
ldxr x0, [counter] // load exclusive
add x1, x0, #1
stxr w2, x1, [counter] // store conditional
cbnz w2, retry // failed? retry
ARMv8.1 has LDADD for direct fetch-add, comparable to x86. Either way, the cost is dominated by cache-coherence traffic, not instruction count.
This is also where the high-contention scalability problem appears. A counter shared across N cores serializes — only one core at a time owns the cache line. At 32 cores, throughput plateaus or goes backwards. Mitigations: per-thread counters with periodic sum, sharded counters across multiple cache lines, or padded array with one counter per core.
Why atomic operations matter
- Counters and statistics. Web servers tally requests with atomic counters. JVM tracks GC events. Linux per-CPU counters are atomic locally but periodically merged.
- Lock-free queues. Michael-Scott queue, MPSC ring buffers, and disruptor patterns are built from atomic loads, stores, and CAS.
- Garbage-collected reference counts. Rust's Arc, C++ shared_ptr, Swift's reference counts — all use atomic increment/decrement.
- Kernel data. Linux
atomic_t, futex word, RCU pointer flips. The kernel uses thousands of atomics; their overhead is critical. - Lock implementations. Mutex acquire is typically a CAS on the lock word; the kernel-mediated wait path is slow, but the fast path is one atomic.
- Sequence numbers. Distributed systems often use a process-local atomic counter as the seqno source.
Common misconceptions
- Always lock-free. The C++ standard allows std::atomic to be implemented with a mutex if hardware doesn't support the size.
atomic<T>::is_lock_free()tells you for sure. 64-bit atomics on 32-bit ARM may use a lock internally. - Atomic implies ordered. Atomicity and ordering are separate properties. memory_order_relaxed gives atomicity with no ordering — a release/acquire pair on a different variable is needed to synchronize a producer/consumer protocol.
- Atomic is slow. Uncontended atomics on cache-resident lines cost 10-30 cycles — single-digit nanoseconds. The slowness is contention, not the operation itself.
- volatile is enough. volatile in C/C++ prevents compiler reordering, but provides no atomicity (a 64-bit volatile read on a 32-bit machine can tear) and no inter-thread ordering. Use atomic.
- Two atomics combine. Two separate atomic operations are atomic individually, not jointly. Reading atomic x and atomic y gives you values that were each instantaneously consistent but may correspond to different points in time.
- Atomic adds always linearize FIFO. Hardware doesn't guarantee fairness. The order in which contending CAS or fetch-add operations succeed is implementation-defined; over-attempting cores may starve others briefly under heavy contention.
Frequently asked questions
What's the difference between atomic and volatile?
In C/C++, volatile prevents the compiler from optimizing away or reordering accesses to a variable but provides no atomicity or thread-synchronization guarantees. atomic (C11/C++11) provides true atomic read-modify-write plus the memory ordering you specify. In Java, volatile guarantees atomic reads/writes for primitive types and provides happens-before ordering — closer to C++ atomic with sequentially-consistent ordering. The lesson: in C/C++, volatile is not for threads; use atomic. In Java, volatile is fine for simple flags but doesn't provide atomic increment — use AtomicInteger for counters.
Why are atomics 10x slower under contention?
An atomic operation forces the cache line into the executing core's L1 cache in modified state, evicting any other core's copy. Uncontended, this costs about 10-30 cycles. Under contention, the cache line ping-pongs across the inter-core network: each operation invalidates other cores, takes ownership, completes, and releases — only for another core to do the same. Round-trip cache-coherence latency between cores is 50-100 ns; between sockets, 100-200 ns. With 8 contending cores, throughput collapses to a single shared-line update every ~100 ns instead of one per core per cycle.
What is LL/SC and how does it differ from CAS?
Load-Linked/Store-Conditional is a pair of instructions used by RISC architectures (ARM, POWER, RISC-V, MIPS, Alpha). LL reads a memory location and marks it as monitored. SC writes back, but only succeeds if no other core has written that location since the LL. Difference from CAS: CAS checks values, LL/SC checks reservation. LL/SC is more general — it lets you compute arbitrary functions between LL and SC, not just compare-replace. But it's slightly weaker: SC may fail spuriously (no other writer, but the reservation was lost — e.g., to a context switch or interrupt). ARMv8.1 added a true CAS as an additional primitive.
What does memory_order_relaxed allow?
memory_order_relaxed gives you atomicity (no torn writes, no half-observed values) but zero ordering guarantees with respect to other memory accesses. The compiler and CPU may reorder a relaxed atomic operation freely with respect to surrounding non-atomic operations. It's appropriate for monotonic counters where you don't care about ordering — e.g., statistics counters, reference counts on increment (decrement still needs acquire-release). Relaxed atomics often compile to a regular load/store on x86 (which is already TSO) and to LDR/STR with no fence on ARM, making them the fastest atomics.
Can atomic operations compose?
Two atomic operations are not jointly atomic. Reading atomic x then atomic y gives you values that were each individually consistent but not necessarily simultaneous — between your two reads, both could change. To get composed atomicity you need a lock, transactional memory, or a hand-rolled CAS protocol like the one underlying Michael-Scott queue. This is why atomic counters are easy but atomic linked lists are hard. Composing also breaks lock-freedom guarantees if you need multi-step protocols — the loop can in principle live-lock under adversarial scheduling.
Why are 128-bit atomics rare?
128 bits exceeds the natural cache-line atomic granularity. Hardware implements atomicity by routing the operation through one core's L1 in modified state — supporting a 128-bit atomic means the line layout, alignment, and bus protocol must all guarantee no torn 64-bit halves. Intel x86 supports CMPXCHG16B (16-byte CAS, often used for tagged pointers and ABA prevention) since 2005, and ARM has LDP/STP pairs but they're not atomic in the same sense. Most languages expose only 64-bit atomics; 128-bit is opt-in via hardware-specific intrinsics.