Concurrency
Compare-and-Swap (CAS)
One CPU instruction reads, compares, and conditionally writes — atomically
Compare-and-swap (CAS) is a hardware-level atomic instruction that takes a memory location, an expected value, and a new value: it writes the new value only if the location currently holds the expected value. The operation is atomic — no other thread can observe a partial state. Available since Intel 80486 (1989) as CMPXCHG, ARMv8 (CASAL), and SPARC (CAS). It is the universal building block for lock-free algorithms — Treiber stack, Michael-Scott queue, lock-free hash maps, and atomic counters all rely on a CAS loop: read, compute new value, CAS — retry on failure.
- Latency~10-30 cycles
- x86 instructionCMPXCHG (LOCK prefix)
- ARMv8CASAL
- First inIBM 370 1973, x86 80486 1989
- Lock-free primitiveYes
- ABA problemYes (tagged pointers, hazard pointers)
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How CAS works
The semantics fit in a few lines of pseudocode:
function CAS(addr, expected, new):
atomic {
if *addr == expected:
*addr = new
return true
else:
return false
}
The atomic block is the entire point — between the comparison and the write, no other thread can observe or modify *addr. CAS reports success or failure so callers can react: success means "I did the update"; failure means "someone got there first, retry."
The instruction has been around since IBM System/370 in 1973. Intel introduced CMPXCHG on the 80486 in 1989, with the LOCK prefix for multi-CPU bus locking. ARM took a different route initially with load-linked/store-conditional pairs (LDREX/STREX) but added a true CAS in ARMv8.1 (CASAL) when atomic-heavy workloads became common.
The CAS loop pattern
Almost every use of CAS lives inside a retry loop. The pattern: read a value, compute a new value based on it, CAS to commit. If CAS fails, someone else updated the location — reread and retry.
function atomic_increment(counter):
loop:
old = *counter
new = old + 1
if CAS(counter, old, new): return new
// else retry
This is optimistic concurrency control at the hardware level. You assume no contention, do the work, and only pay if your assumption fails. Under low contention, the loop almost never retries. Under high contention, it may retry many times — but progress is guaranteed: at least one thread always succeeds, so the system never deadlocks.
Treiber's lock-free stack
R. Kent Treiber's 1986 IBM technical report introduced one of the first lock-free data structures, a stack built on CAS:
struct Node { value: T; next: *Node }
top: *Node = null
function push(value):
n = new Node(value, null)
loop:
n.next = top
if CAS(&top, n.next, n): return
function pop():
loop:
old = top
if old == null: return null
if CAS(&top, old, old.next): return old.value
One CAS per operation. No locks, no waiting. Under low contention, both push and pop run in tens of nanoseconds. Treiber's stack is a teaching reference — almost every textbook on concurrent data structures starts here. It also has a famous flaw: the ABA problem.
The ABA problem
Consider a Treiber stack with top = A, A.next = B. Thread T1 reads old = A and is preempted before its CAS. Meanwhile T2 pops A (now top = B), pops B (top = null), then pushes a fresh A (top = A again). T1 resumes and runs CAS(&top, A, A.next). The CAS sees the value is A — match! It succeeds, setting top to A.next, which is the freed B. The stack is now corrupt.
The general lesson: equal pointer values do not imply unchanged structure. CAS only checks bits.
Mitigations:
- Tagged pointers. Pack a 16- or 32-bit version counter alongside the pointer. Every push/pop increments the counter. CAS the entire word — old A with version 5 won't match new A with version 7.
- Hazard pointers. Each thread publishes which pointers it's about to dereference. Reclamation skips any pointer currently a hazard.
- Epoch-based reclamation. Track global epochs; defer freeing until no thread is in an old epoch.
- RCU (read-copy-update). Linux kernel's pattern: defer reclamation until a quiescent state when no reader holds the old pointer.
What CAS costs in hardware
A bus-locked CMPXCHG forces the cache line into the modified state in the executing core's L1 cache, evicting copies from other cores. On modern x86 with cache-coherent NUMA:
- Uncontended (line already modified locally): ~10-30 cycles — comparable to a regular store with a fence.
- Lightly contended (line is in shared state): ~50-100 cycles — invalidate other cores, take exclusive ownership.
- Heavily contended (8+ cores hammering the same line): hundreds of cycles per CAS, and the CAS loop retries multiple times — effective throughput collapses.
- Cross-socket (different NUMA node owns the line): 200-500 cycles — the line traverses the inter-socket link.
This is why hot atomic counters are a notorious scalability anti-pattern. Sharded counters, per-thread accumulators with periodic flush, and lock combiners all aim to remove the single point of contention.
Why CAS matters
- Lock-free programming. Treiber stack, Michael-Scott queue, Harris linked list, Cliff Click's lock-free hash map — all built from CAS loops.
- Atomic counters.
std::atomic<int>::fetch_add, Java'sAtomicLong.incrementAndGet, Go'satomic.AddInt64— under the hood, a CAS loop or a hardware fetch-add (which is essentially CAS internally). - Kernel concurrency. Linux uses CAS for futexes, RCU pointer swaps, and per-CPU counter flushes. Without CAS, the kernel's scalability would collapse on many-core machines.
- JVM AtomicX classes. The entire
java.util.concurrent.atomicpackage is a thin layer over hardware CAS, exposed for application use. - Garbage collectors. Concurrent and parallel GCs use CAS to coordinate marking, claim work units, and update pointers without stopping the world.
- Database concurrency. Optimistic concurrency control protocols use CAS-like version-check-then-write, sometimes implemented as a hardware CAS on a row's version counter.
Common misconceptions
- Free of contention. CAS is fast uncontended but expensive contended — under heavy contention a well-tuned mutex with backoff can outperform a naive CAS loop.
- Obsoletes locks. Only for single-word operations. Multi-word atomicity still needs locks, transactional memory, or careful protocol design.
- Lock-free is wait-free. Different progress guarantees. Lock-free: at least one thread makes progress system-wide. Wait-free: every thread makes progress in bounded steps. CAS loops are lock-free but generally not wait-free — a thread can in principle retry indefinitely.
- CAS implies memory ordering. Hardware CAS implementations differ. x86 CMPXCHG is sequentially consistent. ARMv8 CAS comes in relaxed (
CAS), acquire (CASA), release (CASL), and full (CASAL) variants — pick the right one or use C++11's memory_order_seq_cst. - ABA only matters for memory reclamation. ABA can also corrupt counters and version-checked optimistic protocols, anywhere equal-bits doesn't mean unchanged-history.
- CAS is always a single instruction. On older ARM, CAS is emulated via LDXR/STXR pair — two instructions and a possible spurious failure (the store-conditional can fail even without contention).
Frequently asked questions
What is the ABA problem?
ABA happens when a value changes from A to B and back to A between your read and your CAS. CAS sees A and succeeds — but the world has shifted underneath. In a lock-free stack, thread T1 reads top = A; thread T2 pops A then pops B then pushes A back; T1 CAS succeeds, but A's next pointer now points at freed memory. Mitigations: tagged pointers (pack a 16-bit version counter alongside the address; CAS the whole word), hazard pointers, epoch-based reclamation, or 128-bit double-CAS where the upper 64 bits hold a version counter.
How does Treiber's stack use CAS?
Treiber's lock-free stack (1986) uses one CAS per push and per pop. To push: allocate a new node with next pointing at the current top; CAS top from old to new. If CAS fails (another thread pushed), reread top and retry. To pop: read top; read top.next; CAS top from old to old.next. Returns the popped value. Both operations are wait-free for the writer if there's no contention, and lock-free under contention. The structure is one of the simplest lock-free designs and remains a reference in concurrent programming textbooks.
Why is CAS atomic but not transactional?
CAS atomically updates exactly one memory location, conditional on its current value. It does not span multiple locations or roll back on failure — failure simply means the memory wasn't what you expected, and your code must retry. A transaction lets you bundle multiple memory accesses with all-or-nothing semantics (think hardware transactional memory like Intel TSX). CAS is a primitive on which transactions can be built, but it is not itself a transaction. This is why a CAS loop is the lock-free equivalent of optimistic-concurrency-control retry, not of a database transaction.
What is double-CAS (DCAS)?
Double-CAS atomically updates two memory locations conditional on their current values. It was proposed by Greenwald and Cheriton in 1996 because many lock-free algorithms (e.g., Harris's lock-free linked list, deque designs) need to swing two pointers at once. No major architecture has implemented arbitrary DCAS — it would require a multi-cache-line atomic, which is hardware-expensive. x86 offers CMPXCHG16B (compares and swaps 16 contiguous bytes — useful for tagged pointers) but not arbitrary DCAS. Software DCAS via lock or transactional memory exists but is slower than two single CAS operations.
What's the cost on contention vs uncontended?
An uncontended CAS on a cache line already in modified state costs roughly 10-30 cycles on modern x86. Under contention, the cache line ping-pongs between cores — each CAS forces the line into the executing core's L1 cache exclusively, evicting it from other cores. With 8 contending cores, a single CAS can cost hundreds of cycles, and the loop may retry dozens of times. This is why under heavy contention a well-tuned mutex (which spins briefly then sleeps) can outperform a naive CAS loop. Hot atomic counters are a classic anti-pattern in scalability.
How does Java's AtomicReference.compareAndSet map to hardware?
Java's AtomicReference.compareAndSet(expected, newRef) compiles down to a single CMPXCHG with the LOCK prefix on x86, or a CASAL on ARMv8, or LDXR/STXR pair on older ARM. The JIT inlines the operation — there's no method call overhead beyond the bus-locking instruction itself. Java's AtomicLong and AtomicInteger are similarly thin wrappers. The Unsafe class (and now VarHandle) exposes raw CAS for advanced use. JVM's intrinsic compilers ensure that a CAS with implicit volatile semantics on a hot field generates exactly one bus-locked instruction with appropriate memory barriers.