Concurrency

Lock-Free Data Structures

Q: How does the Michael-Scott queue work?

Maglev's predecessor for FIFO queues. Two atomic pointers: head and tail. Enqueue reads tail, attempts CAS to swing tail's next from null to the new node, then CAS swings tail itself. If another thread interleaved, the second CAS fails and the new thread helps complete the partial enqueue before retrying. Dequeue CAS-swings head forward. The clever part is the helping protocol: any thread observing a stale tail must advance it before proceeding. Published in PODC 1996 by Maged Michael and Michael Scott; used in Java's ConcurrentLinkedQueue.

Guarantee global progress with atomic CAS — Treiber stack, Michael-Scott queue, hash maps

A lock-free data structure guarantees that some thread always makes progress — no thread can be blocked indefinitely by another's stalling. It avoids mutexes entirely, relying on atomic compare-and-swap (CAS) loops. Canonical examples: the Treiber stack (1986, IBM), the Michael-Scott FIFO queue (1996), and lock-free hash tables (Java ConcurrentHashMap, Cliff Click 2007). Strictly stronger than wait-free (every thread completes in bounded steps) but weaker than blocking. Lock-free code is hard — must handle ABA, hazard pointers, or epoch-based reclamation for safe memory management.

ProgressLock-free (system-wide)
StrongerWait-free (per-thread)
Memory reclamationHazard pointers, RCU, epoch
Treiber stack1986 IBM
Michael-Scott queue1996
ABA problemTagged pointers, hazard pointers

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why lock-free matters

Low-latency trading. Order books and matching engines need predictable latency without scheduler-induced jitter; a blocked thread holding a mutex can stall the entire pipeline for milliseconds.
Kernel data structures. Linux uses RCU for the routing table, dentry cache, and security policy because read traffic dominates and a single writer must not block thousands of readers.
Message queues. Single-producer single-consumer (SPSC) ring buffers run at 100M+ ops/sec on modern CPUs; multi-producer multi-consumer (MPMC) variants like Vyukov's bounded queue hit 30-50M ops/sec.
OS schedulers. Per-CPU run queues with lock-free work stealing (Tokio, Go runtime, Rayon) avoid the cache-line contention of a global lock.
Real-time systems. Priority inversion is impossible with lock-free code — there is no priority to invert. RT-kernels often mandate lock-free or at least priority-inheriting locks.
Database concurrency. InnoDB's adaptive hash index, MongoDB's WiredTiger transaction state, RocksDB's MemTable use lock-free or lock-light structures to scale across cores.

How lock-free actually works

Every lock-free operation reduces to a CAS loop: read the current state, compute the new state, atomically commit only if no one else modified it. Failure means retry. The atomic primitive is compare-and-swap, available as cmpxchg on x86 (one byte to 16 bytes) and via load-linked / store-conditional on ARM and POWER.

The Treiber stack push is the canonical example:

void push(Node* n) {
    do {
        n->next = top.load(memory_order_relaxed);
    } while (!top.compare_exchange_weak(
        n->next, n,
        memory_order_release, memory_order_relaxed));
}

Pop is harder because of ABA: thread A reads top = X, plans to CAS top from X to X.next. Thread B pops X, pops X.next, then pushes X again. A's CAS succeeds (top is still X) but the structure is corrupted because X.next is stale. Solutions: tagged pointers (16-bit counter + 48-bit pointer in one 64-bit word, requires cmpxchg16b), hazard pointers, or epoch-based reclamation.

The memory reclamation problem

The hardest part of lock-free programming is not the algorithm — it's deciding when memory can be safely freed. Three production techniques:

Hazard pointers (Michael, 2004). Each thread publishes pointers it's actively dereferencing. A reclaimer scans these before free(). Memory bound: O(P × K × R) where P = threads, K = max hazards per op, R = reclaim batch size. Per-read overhead: 2-3 atomic stores + memory barrier.
RCU (Linux kernel, McKenney 1998+). Readers don't publish anything — they just disable preemption. Writers wait for a grace period (all CPUs to schedule once) before freeing. Read overhead: zero on x86 (just compiler barriers). Write overhead: milliseconds.
Epoch-based (Fraser 2004). Global epoch counter incremented periodically. Threads pin to current epoch when entering a critical section. Memory freed when no thread is pinned to its origin epoch. Used in Crossbeam (Rust), Folly's UnboundedQueue.

The ABA problem in detail

ABA bit IBM in their original Treiber stack. Suppose threads share a stack with nodes A → B → C. Thread T1 begins pop: reads top = A, reads A.next = B, intends to CAS top from A to B. Before T1 commits, T2 pops A, pops B, then pushes A back (perhaps with new payload but same address — a freelist allocator commonly recycles addresses). Stack is now A → C. T1's CAS sees top = A and succeeds, setting top to B. But B has been freed; the stack is corrupted.

Three production fixes:

Tagged pointers / DWCAS. Pack a 32-64 bit counter into the pointer word; increment on every CAS. The chance of full counter rollover within a critical section is astronomically small.
Hazard pointers. A is hazardous to T1, so T2 cannot recycle A's memory while T1 holds the pointer.
Garbage-collected languages. Java's GC keeps A reachable as long as T1 holds it, so the address won't be recycled. This is one reason Java ConcurrentLinkedQueue is much shorter than its C++ analog.

Real-world benchmarks

Numbers from Folly's MPMCQueue benchmark on a 2024 24-core Xeon:

std::queue + std::mutex (1 producer, 1 consumer): 4.2M ops/sec; 4 producers, 4 consumers: 1.1M ops/sec (contention destroys throughput).
boost::lockfree::queue (Michael-Scott): 12M ops/sec at 1+1; 18M at 4+4.
Folly MPMCQueue (bounded, slot-based): 65M ops/sec at 4+4; scales with cores.
SPSC ring buffer (Vyukov): 200-300M ops/sec.

The pattern: lock-based queues degrade with contention; lock-free queues scale (sub-linearly because of cache-line bouncing on the head/tail pointers).

Common misconceptions

"Lock-free is always faster than locked." Under low contention, a futex-based mutex held for 50 ns is hard to beat. Lock-free's CAS loop pays atomic-instruction cost on every retry, plus memory ordering barriers; if the mutex never sleeps, locked code wins.
"Lock-free has no memory issues." The opposite: memory reclamation is the hardest part. Naive "just use shared_ptr" works in C++ but reference counting is itself a synchronization problem and degrades under contention.
"x86 ordering is enough." Code that compiles cleanly and runs correctly on x86 may fail on ARM because x86 has strong TSO ordering. Always specify memory_order explicitly; test on weakly ordered hardware before shipping.
"std::atomic guarantees lock-free." std::atomic<T> for large T is allowed to use a hidden mutex. Check is_lock_free(); on most platforms 1, 2, 4, 8-byte atomics are lock-free, 16-byte is sometimes (with cmpxchg16b), larger almost never.
"Wait-free is strictly better." Wait-free is theoretically stronger but typically has higher constant factors (helping arrays, fast/slow paths). Lock-free is the sweet spot for most production code.
"Compiler reordering doesn't matter under atomic." Wrong memory_order lets the compiler reorder non-atomic operations across atomic ones. memory_order_relaxed gives almost no ordering guarantees; the algorithm must explicitly synchronize.

When to choose lock-free vs alternatives

Use a mutex when: contention is bursty, critical section is small, you don't need real-time guarantees. This is the default for most application code.
Use a reader-writer lock when: 95%+ of accesses are reads and reads are non-trivial work.
Use lock-free when: profiling shows the lock as a bottleneck, a battle-tested algorithm exists (Michael-Scott, Treiber, Harris linked list), latency matters more than peak throughput.
Use RCU when: kernel-style read-mostly with rare slow writes, e.g. config reload, routing table.
Use transactional memory (Intel TSX, where available) when: critical section touches multiple disjoint locations and you can tolerate retry under abort.

Frequently asked questions

What's the difference between lock-free and wait-free?

Lock-free guarantees that at least one thread makes progress in bounded steps — system-wide forward progress. A specific thread can still be starved indefinitely if it keeps losing CAS races. Wait-free is strictly stronger: every individual thread completes its operation in bounded steps regardless of other threads. Wait-free implementations typically have higher constant factors (helping mechanisms, fast/slow paths). Most production lock-free queues and stacks are lock-free but not wait-free; Herlihy proved wait-free implementations exist for any object but are often impractical.

How does the Michael-Scott queue work?

Two atomic pointers: head and tail. Enqueue reads tail, attempts CAS to swing tail's next from null to the new node, then CAS swings tail itself. If another thread interleaved, the second CAS fails and the new thread helps complete the partial enqueue before retrying. Dequeue CAS-swings head forward. The clever part is the helping protocol: any thread observing a stale tail must advance it before proceeding. Published in PODC 1996 by Maged Michael and Michael Scott; used in Java's ConcurrentLinkedQueue.

What are hazard pointers and why do you need them?

Memory reclamation problem: thread A reads a pointer P from a shared structure, plans to dereference. Before A dereferences, thread B removes the node and frees memory. A reads garbage or segfaults. Hazard pointers (Maged Michael, 2004) solve this: each thread publishes its currently-accessed pointers in per-thread slots. Before freeing a node, a reclaimer scans all hazard pointer slots; if the node is hazardous, defer reclamation. Bounded memory overhead — at most O(threads × max-hazards-per-op) unreclaimed nodes.

How does RCU compare?

Read-Copy-Update (Linux kernel, 1998-onward) is a different memory reclamation discipline. Readers run in critical sections that can't sleep; writers create new copies of data, atomically swap pointers, then wait for all in-flight readers to finish (a grace period) before freeing the old copy. Reader cost: zero — just memory barriers, no atomic ops. Writer cost: synchronize_rcu can take milliseconds. Trade-off: extreme read scaling (used for kernel routing tables, dcache, security policy) but writes are slow. Hazard pointers have lower latency for writes but per-read overhead.

When should you NOT use lock-free?

Several cases. Low contention: a mutex held briefly costs ~25 ns uncontended; lock-free overhead (CAS retry, memory ordering) often costs more. Complex invariants: lock-free is feasible only for operations with well-defined linearization points — multi-word atomic updates require LL/SC, transactional memory, or DCAS. Code complexity: lock-free maintenance is dramatically harder; bugs are timing-dependent and rare. Memory pressure: hazard pointers and RCU defer reclamation, raising peak memory. As a rule: use lock-free only when profiling shows a contention bottleneck and a known good algorithm exists.

What's the role of memory ordering?

x86 has strong memory ordering (TSO) — most reads and writes become naturally ordered. ARM/POWER are weakly ordered: a thread can see another thread's writes in a different order than they were issued. Lock-free code must specify memory_order on every atomic op: relaxed (no ordering), acquire (reads after this see prior writes), release (writes before this are visible to acquirers), seq_cst (total order). Wrong ordering produces correct-looking code that fails on ARM but works on x86. Java uses volatile + happens-before; C++ has std::atomic with explicit memory_order.