Concurrency
Read-Write Lock
RWLock — many readers can hold the lock simultaneously; writers exclude all
A read-write lock (RWLock) is a synchronization primitive that allows concurrent shared (read) access by multiple threads while ensuring exclusive (write) access by only one thread. Designed for read-heavy workloads — common in caches, configuration data, and DBMS metadata. Performance trade-off: a regular mutex serializes 100% of accesses; a RWLock with 90% reads can serve those 9× faster (limited by writer starvation, cache-line bouncing on the lock state, and overhead of the read counter increment). Variants: reader-preference, writer-preference, fair (FIFO). Pthread pthread_rwlock_t, Java ReadWriteLock, Go sync.RWMutex.
- Read concurrencyUnlimited
- WriteExclusive
- VariantsRead-pref, write-pref, fair
- Reader cost~50-200 ns
- Writer cost100-500 ns
- Use case80%+ read workloads
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why RWLock matters
- Database indexes. A B-tree index is read on every query and modified only on insert/update/delete. With 10:1 read:write ratio (typical OLTP), RWLock cuts contention dramatically vs a plain mutex on the index root.
- Configuration objects. Production services read a config snapshot millions of times per second; reload happens once per minute. RWLock or its read-mostly cousin RCU is the canonical pattern.
- Cache layers. An LRU cache exposes get(), put(), evict(); reads are 10-100× more frequent. RWLock allows many concurrent gets while serializing puts. Higher-throughput caches partition the lock per shard or go fully lock-free.
- Copy-on-write structures. Persistent data structures pair RWLock with structural sharing — readers see a stable snapshot while a writer builds the next version atomically.
- File-system metadata. Directory reads (
readdir, stat) vastly outnumber writes (mkdir, unlink). VFS layer in Linux uses per-inode RWLocks (now mostly RCU after the dcache rewrite). - Network routing tables. Per-packet route lookups are reads; topology updates are rare writes. Most kernels and userspace routing daemons use either RWLock or RCU.
How RWLock is implemented
The naive design is a single 64-bit atomic word with a writer flag in the top bit and reader count in the rest:
// Reader acquire
do {
state = lock.load();
if (state & WRITER_BIT) wait();
} while (!lock.compare_exchange(state, state + 1));
// Writer acquire
do {
state = lock.load();
if (state != 0) wait();
} while (!lock.compare_exchange(0, WRITER_BIT));
Real implementations (glibc, jemalloc, Java's ReentrantReadWriteLock) are far more complex because of:
- Wakeup discipline. When a writer releases, do you wake one writer (FIFO) or wake all readers (throughput)? Glibc uses a futex-based wait queue with explicit policy.
- Recursive locking. Some RWLocks allow a thread holding read to acquire read again (recursive read), or upgrade read-to-write under specific conditions. Adds per-thread state tracking.
- Cancellation. Pthread cancellation, async signals, and process-shared semantics complicate the wait/wake logic.
- Cache-friendliness. The hot reader-count word becomes a contention magnet. Per-CPU sharded RWLocks (Linux's percpu_rwsem) eliminate this for reads at the cost of slow writes.
The starvation problem
Reader-preference policy: while reader-count > 0, new readers may enter; writers wait. Sounds throughput-friendly but produces an unbounded waiting time for writers when readers continuously arrive. Documented production bug: a Cassandra-like store had a stat-update writer waiting 30+ seconds during query bursts. Symptoms: stale stats, eventual cluster-wide thread pool exhaustion as writes piled up.
Three production fixes:
- Writer-preference. Once a writer queues, the lock blocks new readers. Existing readers drain, writer proceeds, then a batch of waiting readers can enter together. Preferred for write-correctness-critical code.
- FIFO fairness. Readers and writers form a single queue in arrival order. Java
new ReentrantReadWriteLock(true). Eliminates starvation in both directions but lowers reader throughput because contiguous readers can't bypass intermediate writers in the queue. - Bounded reader window. Reader count caps at K; the K+1th reader waits even if no writer queued. Bounds the worst-case writer wait to K read durations.
RCU as a RWLock alternative
If reads need to scale across many cores, RWLock's reader-counter is itself a bottleneck — every reader pings the same cache line, costing 100+ ns on a multi-socket machine. RCU sidesteps this entirely: readers do not modify shared state.
Numbers from the Linux kernel folks (paulmck.livejournal.com): on a 64-core machine, naive pthread_rwlock_rdlock peaks around 5M reads/sec aggregated; per-CPU rwsem hits 600M/sec; classic RCU has effectively no upper bound (limited only by per-thread work).
The trade-off: RCU writers must wait a "grace period" — typically 1-10 ms — before freeing the old version. So RCU is for genuinely read-mostly data with copyable semantics; RWLock fits when writes need predictable latency and the data can't easily be copied.
Performance benchmarks
From a 2023 benchmark on 16-core x86 (numbers approximate):
- std::shared_mutex, 100% reads, 16 threads: 18M ops/sec.
- std::shared_mutex, 90/10 mix: 12M ops/sec.
- std::shared_mutex, 50/50 mix: 4M ops/sec.
- std::mutex equivalents: 6M, 6M, 5M ops/sec respectively.
- folly::SharedMutex (writer-priority, reader-fast-path): 80M ops/sec at 100% reads.
Takeaway: high-quality RWLocks beat mutexes from ~80% reads up; bog-standard pthread_rwlock can lose to mutex on shorter critical sections because of implementation overhead.
Common misconceptions
- "Always faster than mutex." Wrong on low-contention or short critical sections. The reader-counter atomic adds ~10-30 ns; an uncontended mutex acquire is ~25 ns. Profile before assuming.
- "Fair by default." Pthread_rwlock is reader-preference on Linux glibc, writer-preference on BSD/macOS, configurable. Java's default ReentrantReadWriteLock is non-fair (better throughput, possible starvation).
- "Read-to-write upgrade is safe." Generally not — two readers attempting upgrade simultaneously deadlock. Use upgrade-mode locks (boost::upgrade_mutex) or release-and-reacquire with re-validation.
- "Reader cost is free." Every reader still does an atomic increment + decrement, costing 10-30 ns on x86 and more on cache-cold cases. Hot-path code measures this.
- "Solves the cache problem." RWLock distributes reads across cores but writes still invalidate the lock's own cache line and the protected data's lines. Per-CPU shards or RCU scale further.
- "More readers always faster." The reader-count cache line bouncing across NUMA nodes can make 32-thread RWLock reads slower than 4-thread. Sharded RWLocks or RCU are needed at scale.
When to pick something other than RWLock
- Mutex when: reads are short or rare; simpler reasoning; less code complexity.
- RCU / hazard pointers when: reads must scale to dozens of cores; writes are infrequent and tolerate millisecond latency.
- Lock-free data structure when: a published algorithm exists for your operation set (queue, hash, list).
- MVCC / snapshot isolation when: readers want a consistent point-in-time view rather than a fresh value, common in databases.
- Copy-on-write when: writes can be infrequent and atomic-pointer-swap suffices.
- Per-CPU data + IPI when: counter-style data, where each CPU maintains its own value and reads aggregate.
Frequently asked questions
When is RWLock better than mutex?
When reads dominate (typically >80% of accesses) and reads do non-trivial work (>1 microsecond per read). The win comes from parallel readers; if 8 threads each read for 10 microseconds, a mutex serializes them at 80 microseconds total while a RWLock completes in ~10 microseconds. But the crossover point is high: short critical sections (<200 ns) lose to a plain mutex because the reader-counter increment is itself a contended atomic. Cache hash tables, configuration objects, route tables, and database catalog metadata are typical RWLock wins.
What is writer starvation?
If readers continually arrive and the lock prefers readers (so a writer waits while reader count > 0), an unbounded reader stream blocks the writer forever. Real production bug: a config-update writer waiting tens of seconds while clients hammered reads. Three fixes: (1) writer-preference policy — once a writer is queued, no new readers may acquire; (2) FIFO fairness — readers and writers queue in arrival order; (3) bounded queue length. Pthread rwlock is reader-preference by default on Linux but writer-preference on macOS/BSD; Java ReentrantReadWriteLock has a fair-mode flag.
How is the reader-counter implemented atomically?
Single 32-64 bit word. Common encoding: top bit = writer flag, remaining bits = reader count. Reader acquire: atomic increment if writer bit clear; on contention, retry CAS loop. Writer acquire: CAS from 0 to writer-bit-set. Release is similarly atomic. The single-word design avoids the chicken-and-egg of needing a sub-lock to protect the counter. Glibc's pthread_rwlock uses a more complex multi-field structure to handle wakeups via futex; the high-frequency path is still a single atomic.
What is an upgradable lock?
Some implementations (boost::shared_mutex, Java ReentrantReadWriteLock with a twist, Postgres) offer an intermediate state: held shared but reservable for upgrade to exclusive. Only one thread at a time can hold the upgrade-mode lock; concurrent readers are still permitted. Avoids the deadlock pattern where two readers each try to upgrade — without upgrade-mode, both wait for the other to release, deadlocking. With upgrade-mode, the first claims it; the second blocks at acquire. Use case: 'check then update' patterns in caches.
How does RCU differ from RWLock?
RCU has zero reader overhead — readers don't touch any shared state, just disable preemption (in kernel) or pin to an epoch (in userspace). Writers create a new version, atomically swap pointer, then wait a grace period before freeing the old. RWLock readers must increment/decrement a counter, paying ~10 ns even uncontended and ~hundreds of ns under contention. RCU's downside: writers can block for milliseconds; the data must support copy-on-write semantics. Linux kernel uses RCU for routing, dentry, security policy; RWLock for filesystem locks and superblock state.
Why is RWLock sometimes slower than mutex on low contention?
Three reasons. First, the reader-counter increment is itself a contested atomic operation; even a single reader pays ~15 ns for the atomic, vs ~25 ns for an uncontended mutex. Second, RWLocks are larger (typically 32-56 bytes vs 24-40 for a mutex) so they're more likely to span cache lines or share lines with other data. Third, the implementation is more complex — multiple branches and a longer code path. The break-even point is somewhere around 70-80% reads with critical sections >500 ns.