Operating Systems
Futex
Fast Userspace muTEX — the syscall you only pay when contended
Futex is the Linux primitive (kernel 2.5.7, 2002) that puts the lock-word in userspace and only enters the kernel when a thread must sleep. The substrate beneath pthread_mutex.
- Uncontended acquire~25 ns (one CAS)
- Contended (syscall)~3-5 µs round-trip
- Added inLinux 2.5.7 (2002)
- Backspthread_mutex, sem_t, cond_var
- EquivalentsmacOS __ulock, Win WaitOnAddress
- Speedup vs SysV semaphore10-100× uncontended
Interactive visualization
Press play. Watch threads CAS the lock word in userspace, and only descend into the kernel when contention forces a sleep.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a futex works
Before 2002, every mutex acquisition on Linux was a syscall. Even if nobody was contending, the call into sys_semop or sys_msgsnd cost hundreds of nanoseconds — context switch overhead, kernel mutex bouncing, lots of dead time for a lock that was free anyway. Locks were so expensive that "lock contention" usually meant "your locks themselves are slow," not "your threads are actually fighting."
The futex solved this by splitting the implementation in two:
- Userspace fast path. The lock word is a regular 32-bit integer in user memory. Acquire is a compare-and-swap (CAS): atomically set 0 → 1. Release is a store: 1 → 0. Both are single instructions, ~5-25 ns. No syscall.
- Kernel slow path. Only invoked when a thread needs to block. The futex syscall takes an address and a value: "wait at this address only if the value is still X." The kernel queues the thread on a waitqueue keyed by physical-page address. Other threads call FUTEX_WAKE on the same address to wake the queue.
The genius is the atomicity guarantee in FUTEX_WAIT. The waiter first checks the value in userspace, then calls the syscall passing the expected value. The kernel re-checks under its own lock — if the value has changed, the kernel returns immediately without sleeping. This eliminates the lost-wakeup race that a naive userspace-check-then-syscall would have.
Anatomy of a futex-based mutex
// Lock state values:
// 0 = unlocked
// 1 = locked, no waiters
// 2 = locked, possible waiters
lock(m):
// Fast path: try to take the lock without anyone waiting
if CAS(&m, 0, 1) succeeds:
return // ~25 ns total — no syscall
// Slow path: there's contention
while true:
// Mark as "locked with waiters" so unlocker knows to wake us
c = xchg(&m, 2)
if c == 0:
return // grabbed it — done
// Sleep in kernel until someone signals
futex(FUTEX_WAIT, &m, 2)
unlock(m):
// Fast path: no waiters → just clear the lock
if atomic_dec(&m) != 1:
// Was 2 (had waiters) — wake one
m = 0
futex(FUTEX_WAKE, &m, 1)
Note the three states. The "locked with possible waiters" state (2) is what lets unlock skip the FUTEX_WAKE syscall when nobody is sleeping — also a fast path. The unlocker only enters the kernel when it sees evidence that someone parked.
User space vs kernel — when do we cross?
| Operation | Path | Time | Syscall? |
|---|---|---|---|
| Acquire uncontended | Userspace CAS | ~25 ns | No |
| Release uncontended | Userspace store | ~5 ns | No |
| Acquire contended (no waiters) | CAS + xchg + syscall | ~200 ns + context switch | FUTEX_WAIT |
| Release with waiters | store + FUTEX_WAKE syscall | ~200 ns + waker latency | FUTEX_WAKE |
| Wait → wake → reacquire cycle | Full slow path | ~3-5 µs (context switch dominates) | Yes |
| SysV semaphore (pre-futex) | Always kernel | ~1-2 µs even uncontended | Always |
The headline: under contention, futex is no faster than other syscall-based locks. Under no contention, it is 50-100× faster. Real workloads spend the vast majority of acquire-release pairs with no waiter present — so the average cost is dominated by the fast path.
What futex builds
- pthread_mutex — the canonical futex client; glibc rewrote it in 2003.
- pthread_cond — condition variables sleep on a futex address (the wait counter); pthread_cond_signal does a FUTEX_WAKE.
- pthread_barrier — uses a futex on the arrival counter for wakeups.
- pthread_rwlock — uses two futexes (readers and writers).
- sem_t — POSIX semaphores are a thin shim over a futex on the count.
- std::mutex in libstdc++, std::sync::Mutex in Rust — all wrap futex on Linux.
- Java's java.util.concurrent — the HotSpot JVM uses futex for Object.wait/notify and for ReentrantLock park/unpark.
- Go runtime — sync.Mutex and channel waits use futex via runtime/sema.go.
Almost every modern locking primitive on Linux ultimately bottoms out in a futex syscall. The kernel maintains a single hashed table of waitqueues keyed by virtual address (FUTEX_PRIVATE) or physical address (cross-process shared memory). FUTEX_WAKE walks the queue and wakes up to N threads.
When you'd use futex directly
Almost never in application code — pthread_mutex and std::mutex are the right interfaces. You'd touch futex directly for:
- Custom locking schemes with semantics pthreads doesn't expose (e.g., FIFO-fair locks, biased locks).
- Lock-free / wait-free data structures that need to occasionally park a thread without the overhead of a full mutex.
- Language runtimes implementing their own threading primitives.
- Library writers who must avoid the LD_PRELOAD complexity of intercepting pthread calls.
Pseudo-code: contended lock
// Full three-state mutex (Drepper, "Futexes Are Tricky" 2003)
lock(m):
c = cmpxchg(&m, 0, 1)
if c == 0: return // FAST PATH: was unlocked, now locked
if c != 2:
c = xchg(&m, 2)
while c != 0:
futex(&m, FUTEX_WAIT, 2)
c = xchg(&m, 2)
unlock(m):
if atomic_dec(&m) != 1: // was 2 → had waiters
m = 0
futex(&m, FUTEX_WAKE, 1)
C: raw futex syscall
#include <linux/futex.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdatomic.h>
static long futex(_Atomic int *uaddr, int op, int val,
const struct timespec *to) {
return syscall(SYS_futex, uaddr, op, val, to, NULL, 0);
}
// Drepper-style three-state mutex.
typedef _Atomic int mutex_t;
void m_lock(mutex_t *m) {
int expected = 0;
if (atomic_compare_exchange_strong(m, &expected, 1))
return; // fast path
if (expected != 2)
expected = atomic_exchange(m, 2);
while (expected != 0) {
futex(m, FUTEX_WAIT_PRIVATE, 2, NULL); // sleep
expected = atomic_exchange(m, 2);
}
}
void m_unlock(mutex_t *m) {
if (atomic_fetch_sub(m, 1) != 1) { // was 2
atomic_store(m, 0);
futex(m, FUTEX_WAKE_PRIVATE, 1, NULL); // wake one
}
}
Common pitfalls
- Lost wakeups from missing the three-state pattern. A two-state mutex (0 = unlocked, 1 = locked) can lose a wakeup: thread A unlocks (1→0, no FUTEX_WAKE since no flag of waiters), thread B was already in FUTEX_WAIT. B sleeps forever. The "2 = locked with waiters" state fixes this.
- Calling FUTEX_WAIT without checking the value first. If the value already changed before you syscalled, FUTEX_WAIT correctly returns -EAGAIN and you must re-check. Looping is mandatory.
- Mixing FUTEX_PRIVATE and FUTEX_SHARED. Private futexes use the virtual address (faster lookup, single-process). Shared futexes use the physical page (works across processes, slower). Mixing the two operations on the same futex is a silent bug — wakes miss waiters.
- Forgetting to align the futex word. Linux requires 4-byte alignment for the futex word. Misalignment returns EINVAL.
- Spinning before WAIT without bounds. Adaptive locks spin briefly before calling FUTEX_WAIT. Spinning too long wastes CPU; too little misses the latency win. glibc uses ~100 iterations.
- Owner-died races without robust futexes. A thread that crashes while holding a non-robust futex deadlocks all waiters. Use FUTEX_LOCK_PI / OWNER_DIED for kernel-assisted recovery.
Performance: the win
Concrete numbers from a 2024 measurement on an AMD Zen 4 server (3.5 GHz):
- Uncontended pthread_mutex_lock → 25 ns (one CAS + one store, no syscall)
- FUTEX_WAIT syscall round-trip (woken immediately) → 350 ns
- Full block-on-FUTEX-WAIT cycle (sleep then wake) → 3.2 µs (context switch dominates)
- Pre-futex SysV semaphore acquire (always syscall) → 1100 ns even uncontended
The 44× speedup on the uncontended path is exactly why futex was a turning point. Multithreaded applications that previously avoided locks because of overhead can now lock liberally — the only cost is contention, which is a real workload property, not an artifact of the synchronization layer.
The cost of a context switch (1-3 µs) dominates the contended path. Inside that window: TLB flushes, register save/restore, scheduler choosing the next thread, cache misses on the new thread's working set. Five µs is hard to optimize past — but that's why spinlocks exist for very-short critical sections, and why fine-grained locks (one lock per bucket, per row, per cache line) beat coarse-grained ones.
Frequently asked questions
What does futex stand for and when was it added?
Futex stands for Fast Userspace muTEX. It was added to Linux kernel 2.5.7 in 2002 by Hubertus Franke, Matthew Kirkwood, Ingo Molnár, and Rusty Russell, to solve the cost of System V semaphores — which always trapped to the kernel even when nobody was contending. Futex made the uncontended path syscall-free.
Why is the futex fast path so fast?
Because there's no syscall. The lock state lives in a regular 32-bit integer in userspace. Acquire is a compare-and-swap from 0 (unlocked) to 1 (locked); release is a store of 0. Both are single atomic instructions — about 5-25 ns on a modern x86. Only if the CAS fails (someone already holds the lock) does the kernel get involved via the futex syscall.
What are FUTEX_WAIT and FUTEX_WAKE?
FUTEX_WAIT(&addr, expected) atomically checks whether the value at addr equals expected; if so, parks the calling thread in the kernel until someone wakes it. FUTEX_WAKE(&addr, n) wakes up to n threads parked on that address. The atomicity prevents lost wakeups: if the value changes between the userspace check and the kernel queueing, WAIT returns immediately rather than sleeping.
Why does pthread_mutex use futex?
Before futex, every pthread_mutex_lock issued a syscall regardless of contention — costing 100s of nanoseconds even when the lock was free. Linux glibc rewrote pthread_mutex on top of futex starting in 2003. Now the uncontended lock costs ~25 ns (one CAS); only contention pays the syscall cost. This made fine-grained locking practical in real applications.
How much does a contended futex actually cost?
A FUTEX_WAIT syscall round-trip is roughly 100-200 ns each way. Add a context switch when the thread blocks (~1-3 µs in cache pollution costs). FUTEX_WAKE that unblocks one waiter is similar. End-to-end, a contended lock acquire-wait-wake-reacquire cycle is 3-5 µs. Compare to uncontended (25 ns) — a 100-200× slowdown under contention is exactly why people avoid global locks.
Is futex Linux-only?
Yes — futex is a Linux kernel ABI. Other systems have equivalents: macOS / iOS have __ulock (since 10.12) and ulock_wait/wake; Windows has WaitOnAddress / WakeByAddressSingle (since Windows 8). FreeBSD has _umtx_op. The pattern is identical — userspace fast path + kernel waitqueue keyed on an address — but the syscall interfaces differ.
What's a robust futex and why does it exist?
If a thread dies while holding a futex-based lock, the lock stays held forever and any waiters deadlock. A robust futex (FUTEX_LOCK_PI and the FUTEX_OWNER_DIED flag) lets the kernel detect the dead owner and mark the lock as inconsistent. The next acquirer learns the previous holder died, can run recovery code, and re-establish lock invariants. Used by glibc's robust pthread_mutex variant.