Concurrency
Read-Copy-Update (RCU)
The trick that makes a million readers cost nothing — and the writer wait
Read-Copy-Update (RCU) is a synchronization mechanism that lets readers run lock-free while a writer updates a private copy and atomically swaps in a new pointer, deferring the old version's reclamation until a grace period proves every reader has finished.
- Reader cost~1 load, no atomics
- Writer costcopy + grace period
- Grace period (Linux)~ms to tens of ms
- Best whenreads ≫ writes
- InventedMcKenney & Slingwine, 1990s
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How RCU works
Imagine a routing table read millions of times a second by every packet that flows through a server, but updated only when an admin adds a route — maybe once an hour. A reader-writer lock would still make every one of those millions of reads take and release a lock, bouncing a shared cache line between cores and blocking the rare write. Read-Copy-Update throws that overhead away. Readers take nothing. The price moves entirely onto the writer.
The name spells out the algorithm. A writer that wants to change a node does three things:
- Read the current version through the shared pointer.
- Copy it into a fresh, private allocation and modify the copy. The live structure is untouched, so concurrent readers see no change.
- Update the shared pointer with a single atomic store, atomically publishing the new version. From this instant new readers see the copy; readers that already loaded the old pointer keep using the old version.
The hard part is the fourth, unspoken step: reclaim. The writer cannot immediately free() the old version, because some readers may still hold a pointer to it. It must wait until every reader that could have grabbed the old pointer has finished. That waiting interval is the grace period, and it is the whole reason RCU is hard to build and easy to use.
Publish and subscribe — and the memory barrier nobody sees
The atomic pointer swap is more subtle than "store a word." When the writer publishes, it must guarantee that the contents of the new node are visible to other CPUs before the new pointer is. Otherwise a reader on a weakly-ordered CPU (ARM, POWER) could follow the new pointer and read garbage from a not-yet-flushed node.
RCU enforces this with a publish/subscribe pair of barriers:
- Publish —
rcu_assign_pointer(p, v)emits a release-store: all prior writes to the node complete before the pointer becomes visible. - Subscribe —
rcu_dereference(p)emits a dependency-ordered load: the reader is guaranteed to see the node contents that existed when the pointer was published.
This is why you never just write global_ptr = new_node by hand. The compiler and the CPU are both free to reorder a naive store, and that reordering is exactly the bug RCU's macros exist to prevent. On x86 the subscribe load is a plain MOV (the hardware already orders dependent loads); on ARM it lowers to a load-acquire or relies on address dependency. Either way the reader pays no lock and, on most ISAs, no fence.
The grace period — RCU's beating heart
A grace period is the minimum interval after publication such that every pre-existing RCU read-side critical section has completed. The elegant insight from Paul McKenney and John Slingwine (who patented the idea at Sequent in the 1990s, with kernel RCU landing in Linux 2.5 in 2002) is that you do not need to track which readers hold the old pointer. You only need to know when they have all gone away.
Classic kernel RCU exploits a deliberate restriction: a reader inside rcu_read_lock() may not sleep or block. So if every CPU passes through a quiescent state — a context switch, an idle loop, or a return to user space — at least once after the swap, then no CPU can still be inside the old critical section. The grace period ends when the last CPU reports a quiescent state. No per-reader bookkeeping, no shared counter on the read path — that is what makes reads free.
The writer can wait synchronously with synchronize_rcu(), which blocks until the grace period closes, or register an asynchronous callback with call_rcu(node, free_fn) that fires once it is safe to reclaim. Asynchronous reclaim keeps the writer off the critical path entirely.
When to use RCU
- Read-mostly data where reads outnumber writes by 10× or far more — routing tables, the kernel's
dentrycache, SELinux policy, module lists, NMI-safe lookups. - Latency-critical readers that cannot tolerate a lock's cache-line bounce or the priority-inversion risk of blocking.
- Pointer-traversed structures — linked lists, hash tables, trees — where an update can be expressed as "swap one pointer."
- Contexts that can't lock at all, like interrupt or NMI handlers, where RCU's read side is safe because it never blocks.
Avoid RCU when writes are frequent (grace-period overhead dominates), when readers must mutate, or when an update touches many pointers atomically — those want a lock or a transactional structure. RCU also does not arbitrate between concurrent writers; you still need a spinlock or compare-and-swap on the write side.
RCU vs other synchronization primitives
| RCU | Reader-writer lock | Seqlock | Mutex | Reference counting | MVCC | |
|---|---|---|---|---|---|---|
| Reader cost | 1 load, no atomic | atomic lock/unlock | 2 seq reads + retry loop | atomic lock/unlock | atomic inc/dec | snapshot lookup |
| Readers block writers? | No | Yes | No (writers block readers) | Yes | No | No |
| Writer cost | copy + grace period | lock + in-place edit | bump seq + in-place edit | lock + edit | atomic refcount churn | version chain + GC |
| Reader can mutate? | No | No | No | Yes (as writer) | Yes | No |
| Cache-line contention on read | None | High (shared lock word) | Low | High | High (shared counter) | Low |
| Stale reads possible? | Yes (old version) | No | No (retries) | No | No | Yes (snapshot) |
| Reclamation | Deferred (grace period) | Immediate | Immediate | Immediate | When count hits 0 | Background vacuum |
| Real-world use | Linux kernel, liburcu | glibc rwlock, DB latches | Linux jiffies, timekeeping | std::mutex everywhere | shared_ptr, CPython | Postgres, MySQL InnoDB |
The headline trade is symmetry. A reader-writer lock spreads cost evenly; RCU shoves nearly all of it onto the writer. If you have a thousand readers per writer, that asymmetry is exactly what you want. RCU and MVCC are spiritual cousins — both let readers see a consistent older version while a writer prepares a new one — but RCU defers reclamation by waiting out readers, where MVCC garbage-collects versions no transaction can see.
What the numbers actually say
- Read side: zero atomics. In a non-preemptible kernel build,
rcu_read_lock()andrcu_read_unlock()compile to nothing at all — empty barriers the compiler can elide. Compare that to a reader-writer lock, where each read pays a locked atomic instruction costing ~10–30 ns plus a cache-line bounce of ~100+ ns when contended across cores. - The contended-lock blow-up. A single locked read-modify-write on a shared cache line costs roughly 100 ns when the line is owned by another socket — so a hot rwlock read at 10 million reads/sec across cores can burn ~1 second of CPU per wall-clock second purely on coherency traffic. RCU's read path generates zero coherency traffic, so it scales linearly with cores.
- Grace-period latency. A synchronous
synchronize_rcu()typically returns in single-digit to tens of milliseconds — an eternity for a writer, but irrelevant when writes are rare and reclamation is asynchronous viacall_rcu(). - Memory overhead. RCU holds onto each freed object for one grace period, so under a heavy update burst you can have thousands of objects pending reclamation — measurable memory pressure that
rcu_barrier()and callback batching are designed to bound.
JavaScript implementation
JavaScript is single-threaded inside one event-loop turn, so we can't show true CPU-level RCU. But the shape of the algorithm — read via a shared reference, copy-modify-publish, defer reclaim until readers drain — models cleanly across asynchronous "readers." This version tracks in-flight readers and only runs the deferred free once they have all completed a grace period.
class RCUCell {
constructor(value) {
this._ptr = Object.freeze(value); // the published version (immutable)
this._readers = 0; // readers currently in a critical section
this._reclaimQueue = []; // { version, free } awaiting a grace period
}
// SUBSCRIBE: returns the current published version. Caller must call
// readUnlock() when done so the reclaimer knows the version is in use.
readLock() {
this._readers++;
return this._ptr; // a plain reference — no copy, no lock
}
readUnlock() {
this._readers--;
if (this._readers === 0) this._drain();
}
// PUBLISH: install a fresh copy, then schedule the old version's free
// once every reader present at publication time has left.
update(mutator) {
const oldVersion = this._ptr;
const copy = structuredClone(oldVersion);
mutator(copy); // modify the private copy only
this._ptr = Object.freeze(copy); // atomic-ish publish (single assignment)
this._reclaimQueue.push({ version: oldVersion, snapshot: this._readers });
this._drain();
}
// GRACE PERIOD: an entry is reclaimable once no reader from its era remains.
_drain() {
if (this._readers > 0) return; // some reader still in flight — wait
while (this._reclaimQueue.length) {
const { version } = this._reclaimQueue.shift();
// safe to free: every reader that could hold `version` has left
// (in JS, GC does the actual free once we drop the reference)
void version;
}
}
}
// Usage — millions of cheap reads, one rare copy-modify-publish:
const routes = new RCUCell({ '10.0.0.0/8': 'eth0' });
function lookup(prefix) {
const table = routes.readLock(); // grab the snapshot
try { return table[prefix]; } // pure read, no contention
finally { routes.readUnlock(); }
}
routes.update(t => { t['192.168.0.0/16'] = 'eth1'; }); // writer copies + publishes
Two details mirror real RCU. First, readLock() hands back the live reference with no copy — that's the whole point of the cheap read. Second, the old version is only dropped in _drain() after the reader count hits zero, which is JavaScript's stand-in for a grace period.
Python implementation
Python's threading gives us real readers, and the GIL makes a single reference assignment effectively atomic — a convenient match for RCU's publish step. We approximate the grace period with an epoch barrier: a writer waits until every thread that entered before the swap has bumped its epoch counter.
import threading, copy, time
class RCUCell:
def __init__(self, value):
self._ptr = value # published version
self._lock = threading.Lock() # serializes WRITERS only
self._global_epoch = 0
self._reader_epochs = {} # thread_id -> epoch it entered at
# SUBSCRIBE — readers take no lock; a plain read of the reference.
def read_lock(self):
self._reader_epochs[threading.get_ident()] = self._global_epoch
return self._ptr # snapshot — cheap, contention-free
def read_unlock(self):
self._reader_epochs.pop(threading.get_ident(), None)
# PUBLISH + grace period. Writers serialize; readers never block.
def update(self, mutator):
with self._lock: # one writer at a time
new = copy.deepcopy(self._ptr)
mutator(new)
self._ptr = new # atomic publish (GIL-protected store)
self._global_epoch += 1
cutoff = self._global_epoch
# synchronize_rcu(): wait out readers from the previous epoch
while any(e < cutoff for e in list(self._reader_epochs.values())):
time.sleep(0.0005)
# grace period over: the old version is now unreachable, free to GC
# Usage
cell = RCUCell({"count": 0})
def reader():
snap = cell.read_lock()
try:
_ = snap["count"] # read the snapshot
finally:
cell.read_unlock()
cell.update(lambda d: d.__setitem__("count", d["count"] + 1))
Note how the writer's while loop is the explicit grace period: it spins until no reader is still tagged with a pre-swap epoch. Real kernel RCU does the same thing far more efficiently by watching for quiescent states instead of polling per-reader counters.
Variants worth knowing
Classic (non-preemptible) RCU. The original. Read-side critical sections cannot sleep; grace periods end when every CPU passes through a quiescent state. Read lock/unlock compile to nothing. Fastest reads, strictest rules.
Preemptible RCU (PREEMPT_RT / Tree RCU). On preemptible kernels, a reader can be scheduled out mid-critical-section, so RCU tracks per-task nesting counters to detect when a preempted reader resumes and finishes. Slightly heavier reads, usable on low-latency kernels.
SRCU (Sleepable RCU). Permits blocking inside the read-side critical section by giving each SRCU domain its own pair of reader counters. Grace periods are slower and the read side pays a memory barrier, but you can sleep while holding a reference — useful for code that does I/O under RCU protection.
Tasks RCU. A grace period ends only when every task has voluntarily blocked or reached user space. Built for tracing/eBPF trampolines, where you must wait for code that has no explicit read-side markers at all.
Userspace RCU (liburcu). Brings the same primitives to ordinary applications. Flavors include QSBR (quiescent-state-based, fastest, app must announce quiescent states), memory-barrier, and signal-based — trading read-side cost for how much the application must cooperate.
Common bugs and edge cases
- Dereferencing outside a read-side critical section. If you hold a pointer obtained via
rcu_dereference()afterrcu_read_unlock(), a grace period can elapse and the object can be freed under you — a classic use-after-free. - Plain assignment instead of
rcu_assign_pointer(). A naivep = newlets the compiler or a weakly-ordered CPU publish the pointer before the node's fields are visible, so a reader sees uninitialized memory. - Sleeping inside classic
rcu_read_lock(). The grace-period machinery assumes readers are short; a sleeping reader can stall reclamation forever. Use SRCU if you must block. - Forgetting to wait before module unload. If callbacks registered via
call_rcu()point into a module's code, you must callrcu_barrier()before freeing the module — otherwise a pending callback jumps into unmapped memory. - Assuming readers see the latest write. RCU is not linearizable for reads: a reader can keep using the old version for the duration of its critical section. If you need every read to observe the newest value immediately, RCU is the wrong tool.
- Using RCU to coordinate multiple writers. RCU only protects readers from writers, not writers from each other. Concurrent updates still need their own lock or atomic CAS.
Frequently asked questions
How can RCU readers be lock-free and still safe?
A reader only ever dereferences a pointer; it never modifies shared state, so there is nothing for it to lock. The writer never frees the old version while readers might still hold it — it waits out a grace period first. Readers pay no atomic instruction and take no lock, so on most architectures the read side compiles to a plain load.
What is an RCU grace period?
A grace period is an interval after a pointer is swapped during which the system waits until every CPU that was inside an RCU read-side critical section has left at least once. Once that happens, no reader can still hold a reference to the old version, so it is safe to free. In the Linux kernel a typical grace period is a few milliseconds to tens of milliseconds.
What's the difference between RCU and a reader-writer lock?
A reader-writer lock still forces readers to take and release a lock, so they contend on a shared cache line and block writers. RCU readers take nothing — they cost a single load — but writers pay more: they must copy, publish atomically, and wait a grace period before freeing. RCU wins overwhelmingly when reads vastly outnumber writes.
Why does RCU update a copy instead of editing in place?
Editing in place would let a reader observe a half-updated structure. By building a fully-formed new version privately and swapping the pointer with a single atomic store, every reader sees either the entire old version or the entire new one — never a torn intermediate. This is the "copy" and "update" in Read-Copy-Update.
Can RCU protect any data structure?
RCU works best for pointer-linked structures read far more than written — linked lists, hash tables, radix trees, routing tables. It is awkward for structures with many cross-links or where readers must mutate, and it does not by itself coordinate multiple concurrent writers; those still need a lock or a compare-and-swap among themselves.
What happens if a reader sleeps inside an RCU critical section?
In classic kernel RCU, sleeping inside rcu_read_lock() is forbidden — the grace period detection assumes readers are short and non-blocking, so a sleeping reader could stall reclamation indefinitely. SRCU (Sleepable RCU) is a variant that explicitly allows blocking by tracking per-domain reader counters at the cost of slower grace periods.