Concurrency

RCU — Read-Copy-Update

Lock-free readers, copy-on-write writers, grace-period reclaim

RCU lets readers run with zero overhead while writers publish new versions atomically. Old data is freed after every reader has moved on — a grace period away.

  • Reader cost~0 ns (preempt-disable)
  • Reader scalingPerfect — no shared state
  • Writer costAllocation + grace period
  • Grace period~1-10 ms typical
  • Introduced (Linux)2.5.43 (2002)
  • Read:write ratio sweet spot100:1 or higher

Interactive visualization

Press play. Watch readers hold old versions, the writer publish a new one, and the grace period bar fill before reclamation.

Open visualization fullscreen

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How RCU works

Imagine a routing table consulted by every packet your kernel forwards — millions of reads per second, occasional updates when a route changes. A reader-writer lock would let many readers run, but the very act of acquiring the rwlock pings a shared cache line across every CPU. That's the kind of overhead a router cannot afford.

RCU eliminates the shared cache line. Readers do not touch any synchronization state at all. The rules:

  • Readers mark a critical section with rcu_read_lock() / rcu_read_unlock(). Inside, they may dereference RCU-protected pointers and use the data they point to. The pair compiles to nothing more than preempt-disable in classic kernel RCU.
  • Writers never modify data in place. They allocate a copy, modify the copy, and use rcu_assign_pointer() (an atomic store with a write barrier) to atomically swap the new copy in.
  • Reclamation waits for a grace period — a period long enough that any reader holding the old pointer has finished its critical section. Then the old copy is safe to free, via synchronize_rcu() or call_rcu(cb).

The three steps for an update look like:

// 1. Read: build a new version
new = kmalloc(...)
*new = *old
new->field = new_value

// 2. Copy: atomically publish
rcu_assign_pointer(global_ptr, new)

// 3. Update: wait for old readers, then free
synchronize_rcu()         // ← block until grace period passes
kfree(old)

Read-Copy-Update — the name itself describes the sequence. Read the current value, make a copy with the updates, swap in the new pointer.

The grace period — the heart of RCU

The grace period is what makes lock-free reads safe. The invariant: before freeing old data, every CPU must have passed through a quiescent state — a moment when it cannot possibly be holding an RCU read-side lock.

In classic Linux kernel RCU, quiescent states include:

  • A voluntary context switch (the running thread cannot have rcu_read_lock() held — preempt is disabled inside the critical section).
  • The CPU going idle.
  • The CPU running in user mode (kernel critical sections are always entered from kernel mode).

The grace-period detection logic conservatively waits for one quiescent state on every CPU. Typical wall-clock duration: 1-10 ms in busy kernels, much longer if the system is mostly idle (an idle CPU still counts as quiescent on next entry/exit). The cost of detection is amortized across many concurrent grace-period requests — synchronize_rcu calls within the same period share a single wait.

RCU vs reader-writer lock

RCUrwlock
Read lock cost~0 ns (preempt-disable)~25-100 ns (atomic RMW)
Reader scalingPerfect (no shared state)Degrades — cache-line bouncing
Reader blocks writerNoYes — must drain readers
Writer blocks readerNo (readers see old version)Yes — readers must wait
Writer costAllocation + grace period (ms)~100 ns acquire + work
Memory overheadOld + new during graceNone beyond lock state
ConsistencyEventual (transient old reads)Strong (no stale reads)
Best for read:write ratio100:1 or higher10:1 to 50:1

The defining trade-off is consistency. An rwlock guarantees no reader sees stale data — once the writer commits, the next reader reads the new value. RCU permits a window in which different readers see different versions. For most kernel data — routing decisions, dcache lookups, security policies — eventual consistency is fine because the system is already racing against external events; a half-millisecond of staleness is invisible.

When RCU shines

  • Read-mostly data structures. If reads outnumber writes by 100× or more, RCU's read-side speed wins overwhelmingly. Routing tables, address-resolution caches, security policy lookups.
  • Sharing across many cores. rwlock cache-line bouncing dominates at 32+ cores. RCU has no shared cache line on the read path, so reader throughput scales linearly with core count.
  • List traversal that can't tolerate iterator invalidation. RCU lists let writers add/remove nodes without breaking concurrent traversals — readers see a consistent snapshot of the list as it was when they started.
  • Hot paths where every nanosecond counts. Network packet receive, system-call dispatch, page-fault handlers — paths so hot that 25 ns of lock overhead per call is unacceptable.

RCU is the wrong choice for write-heavy workloads, for data structures where memory cost matters (you double allocations during the grace period), or for data that absolutely must not be read in a transient old state.

Pseudo-code: linked list with RCU

// Reader — lock-free traversal
find(key):
    rcu_read_lock()
    p = rcu_dereference(head)
    while p != null:
        if p.key == key:
            value = p.value           // safe — old version pinned
            rcu_read_unlock()
            return value
        p = rcu_dereference(p.next)
    rcu_read_unlock()
    return null

// Writer — copy-on-update
update(key, new_value):
    lock(writer_mutex)                // serialize writers
    p = head
    while p != null and p.key != key:
        prev = p
        p = p.next
    if p == null:
        unlock(writer_mutex)
        return
    new_node = clone(p)
    new_node.value = new_value
    rcu_assign_pointer(prev.next, new_node)
    unlock(writer_mutex)
    synchronize_rcu()                 // wait for readers
    kfree(p)                          // safe — no reader holds it

C: Linux kernel RCU usage

#include <linux/rcupdate.h>
#include <linux/slab.h>

struct config {
    int  timeout_ms;
    int  retry_count;
    char hostname[64];
};

static struct config __rcu *current_config;

// READER — zero-cost in classic RCU
int get_timeout(void) {
    struct config *c;
    int t;
    rcu_read_lock();
    c = rcu_dereference(current_config);
    t = c->timeout_ms;
    rcu_read_unlock();
    return t;
}

// WRITER — copy, modify, publish, reclaim
static DEFINE_MUTEX(config_mutex);

int update_timeout(int new_timeout) {
    struct config *old, *new;
    new = kmalloc(sizeof(*new), GFP_KERNEL);
    if (!new) return -ENOMEM;

    mutex_lock(&config_mutex);
    old = rcu_dereference_protected(current_config,
                                    lockdep_is_held(&config_mutex));
    *new = *old;
    new->timeout_ms = new_timeout;
    rcu_assign_pointer(current_config, new);    // publish
    mutex_unlock(&config_mutex);

    synchronize_rcu();                          // wait for readers
    kfree(old);                                 // safe to free
    return 0;
}

Userspace RCU (liburcu)

#define _LGPL_SOURCE
#include <urcu.h>
#include <urcu/rcuhlist.h>

static struct route_table __rcu *routes;

void *reader_thread(void *arg) {
    rcu_register_thread();
    while (running) {
        rcu_read_lock();
        struct route_table *t = rcu_dereference(routes);
        struct route *r = lookup(t, packet.dst);
        forward(packet, r);
        rcu_read_unlock();
    }
    rcu_unregister_thread();
    return NULL;
}

void update_routes(struct route_table *new_table) {
    struct route_table *old = rcu_xchg_pointer(&routes, new_table);
    synchronize_rcu();
    free_route_table(old);
}

Common pitfalls

  • Dereferencing without rcu_read_lock(). The whole guarantee depends on readers being in critical sections. Outside one, the grace-period mechanism cannot account for you, and the data may be freed under your feet.
  • Sleeping inside an RCU read-side critical section. Classic RCU disables preemption — sleeping there is a kernel bug. SRCU (sleepable RCU) is the variant if you need to sleep while protected.
  • Forgetting rcu_assign_pointer. A plain store doesn't include the write barrier that ensures readers see fully-initialized data. A reader might dereference a half-constructed object.
  • Forgetting rcu_dereference. Without it, compiler or CPU reordering can cause the reader to read object fields before the pointer load completes. rcu_dereference includes a data-dependency barrier (relevant on Alpha; consume-barrier elsewhere).
  • Mixing RCU with traditional locking incorrectly. If writers serialize on a mutex but readers use only RCU, fine. If both readers and writers use RCU, you need extra atomic ops or the writers stomp each other.
  • Treating grace period as fast. synchronize_rcu can take milliseconds. Calling it on a hot path (per-request) destroys throughput. Use call_rcu() to defer reclamation asynchronously.

Performance

RCU's signature performance property is read-side scalability. Some measured numbers:

  • Linux kernel RCU read on a 3 GHz CPU: under 1 ns per rcu_read_lock + dereference + rcu_read_unlock
  • Read throughput on a 64-core server: 64× single-core throughput, perfectly linear (no shared cache lines)
  • Same workload with rwlock: 5-10× single-core throughput at 64 cores, due to cache-line bouncing on the lock state
  • Writer cost: typically 1-10 ms for synchronize_rcu, dominated by waiting for the slowest CPU to context-switch
  • Memory overhead: peak ~2× the size of the protected structure during the grace period

The Linux kernel's dcache (directory entry cache) was the first major user; it converted from a per-bucket spinlock to RCU around 2002 and saw a 30% improvement on lookup-heavy workloads even on small machines. On large NUMA systems the gain is multiples. Modern Linux uses RCU across the network stack, security modules, namespaces, the page-fault handler, and dozens more subsystems.

Frequently asked questions

What does RCU stand for and what does it do?

Read-Copy-Update. It's a synchronization technique where readers run lock-free — no atomic operations, no waiting — and writers make changes by creating a new copy of the data, modifying it, and atomically swapping the pointer. Old copies are freed only after a grace period during which all current readers have finished. The Linux kernel uses RCU for read-mostly data: routing tables, the dcache, security modules, lots more.

What is a grace period?

A grace period is the wait between unpublishing an old data version and freeing it. It must last long enough that every reader that could still hold the old pointer has finished its read-side critical section. In classic RCU, a grace period ends when every CPU has performed a voluntary context switch — proving no reader is mid-critical-section on that CPU. synchronize_rcu() blocks the caller until a grace period elapses; call_rcu() schedules a callback to fire after one.

How fast are RCU readers really?

In classic Linux kernel RCU, rcu_read_lock() and rcu_read_unlock() are zero-cost — they compile to nothing or to disabling preemption (one instruction). A reader is essentially the same speed as accessing unprotected data. On a 1 GHz CPU, the entire read path can be under 1 ns per RCU operation, versus 25-100 ns for an rwlock acquire. Reads scale perfectly across cores because there's no shared state to contend on.

What's the trade-off versus an rwlock?

Writers are slower under RCU (they pay the grace period — typically milliseconds before old memory is freed) and use more memory (old copies linger until reclaim). Readers see eventual consistency — they might read either the old or new version during the changeover. In exchange, readers are dramatically faster and don't block on writers. RCU wins anywhere reads dominate writes by 100× or more, which describes most kernel data structures.

Why is the grace period the hard part?

Because there's no central registry of who's reading. The grace-period mechanism must conservatively prove that no reader could still see the old version. Linux kernel RCU does this by waiting until every CPU has been observed in a quiescent state (a context switch, idle, or user-mode — none of which can happen mid-critical-section because preemption is disabled inside rcu_read_lock). The detection logic is intricate but the readers stay free.

Where is RCU used outside the kernel?

Userspace RCU (liburcu) is a library by Mathieu Desnoyers that brings the technique to user programs. It's used in DPDK packet processing, scaling databases like ScyllaDB, real-time data systems, and high-performance game servers. Outside Linux, similar ideas live in epoch-based reclamation (Crossbeam in Rust) and hazard pointers — different mechanisms, same goal of safe lock-free memory reclamation.

Why can't I just use atomic reference counting?

Atomic refcounting works but every reader contends on the same counter — incrementing and decrementing causes cache-line bouncing across all CPUs. Under heavy read traffic this becomes the bottleneck. RCU avoids the contention by having readers do nothing observable — the grace-period mechanism handles reclamation independently. For 95% read workloads, RCU outperforms refcounting by 10-100× as core count rises.