Question 1

What is the ABA problem?

Accepted Answer

ABA happens when a value changes from A to B and back to A between your read and your CAS. CAS sees A and succeeds — but the world has shifted underneath. In a lock-free stack, thread T1 reads top = A; thread T2 pops A then pops B then pushes A back; T1 CAS succeeds, but A's next pointer now points at freed memory. Mitigations: tagged pointers (pack a 16-bit version counter alongside the address; CAS the whole word), hazard pointers, epoch-based reclamation, or 128-bit double-CAS where the upper 64 bits hold a version counter.

Question 2

How does Treiber's stack use CAS?

Accepted Answer

Treiber's lock-free stack (1986) uses one CAS per push and per pop. To push: allocate a new node with next pointing at the current top; CAS top from old to new. If CAS fails (another thread pushed), reread top and retry. To pop: read top; read top.next; CAS top from old to old.next. Returns the popped value. Both operations are wait-free for the writer if there's no contention, and lock-free under contention. The structure is one of the simplest lock-free designs and remains a reference in concurrent programming textbooks.

Question 3

Why is CAS atomic but not transactional?

Accepted Answer

CAS atomically updates exactly one memory location, conditional on its current value. It does not span multiple locations or roll back on failure — failure simply means the memory wasn't what you expected, and your code must retry. A transaction lets you bundle multiple memory accesses with all-or-nothing semantics (think hardware transactional memory like Intel TSX). CAS is a primitive on which transactions can be built, but it is not itself a transaction. This is why a CAS loop is the lock-free equivalent of optimistic-concurrency-control retry, not of a database transaction.

Question 4

What is double-CAS (DCAS)?

Accepted Answer

Double-CAS atomically updates two memory locations conditional on their current values. It was proposed by Greenwald and Cheriton in 1996 because many lock-free algorithms (e.g., Harris's lock-free linked list, deque designs) need to swing two pointers at once. No major architecture has implemented arbitrary DCAS — it would require a multi-cache-line atomic, which is hardware-expensive. x86 offers CMPXCHG16B (compares and swaps 16 contiguous bytes — useful for tagged pointers) but not arbitrary DCAS. Software DCAS via lock or transactional memory exists but is slower than two single CAS operations.

Question 5

What's the cost on contention vs uncontended?

Accepted Answer

An uncontended CAS on a cache line already in modified state costs roughly 10-30 cycles on modern x86. Under contention, the cache line ping-pongs between cores — each CAS forces the line into the executing core's L1 cache exclusively, evicting it from other cores. With 8 contending cores, a single CAS can cost hundreds of cycles, and the loop may retry dozens of times. This is why under heavy contention a well-tuned mutex (which spins briefly then sleeps) can outperform a naive CAS loop. Hot atomic counters are a classic anti-pattern in scalability.

Question 6

How does Java's AtomicReference.compareAndSet map to hardware?

Accepted Answer

Java's AtomicReference.compareAndSet(expected, newRef) compiles down to a single CMPXCHG with the LOCK prefix on x86, or a CASAL on ARMv8, or LDXR/STXR pair on older ARM. The JIT inlines the operation — there's no method call overhead beyond the bus-locking instruction itself. Java's AtomicLong and AtomicInteger are similarly thin wrappers. The Unsafe class (and now VarHandle) exposes raw CAS for advanced use. JVM's intrinsic compilers ensure that a CAS with implicit volatile semantics on a hot field generates exactly one bus-locked instruction with appropriate memory barriers.

Compare-and-Swap (CAS)

Interactive visualization

Watch the 60-second explainer

How CAS works

The CAS loop pattern

Treiber's lock-free stack

The ABA problem

What CAS costs in hardware

Why CAS matters

Common misconceptions

Frequently asked questions