Concurrency

Spinlock

Q: When does a spinlock beat a mutex?

A spinlock beats a mutex when the critical section is shorter than the cost of a context switch. On Linux, a context switch costs 1-5 microseconds (saving registers, switching page tables, scheduler invocation, cache pollution). A typical sleeping mutex's slow path involves at least one context switch on contention. If your critical section is under 1 microsecond — for example, updating a per-CPU statistic, or a quick pointer swap — spinning wastes less CPU than sleeping. Rule of thumb in the Linux kernel: spinlocks for sub-microsecond sections, mutexes for everything longer.

Q: What is a ticket lock?

A ticket lock is a fair spinlock — threads acquire in FIFO order. The lock has two counters: next_ticket (incremented on acquire) and now_serving (incremented on release). To acquire, a thread atomically fetches and increments next_ticket — that's its ticket number. It then spins reading now_serving until now_serving equals its ticket. To release, it increments now_serving by 1, waking the next waiter. Used in the Linux kernel from 2008-2014 (replaced by qspinlock for better cache behavior). FIFO fairness prevents starvation but every release invalidates the cache line on every waiter — O(N) cache traffic per handoff.

Q: What is an MCS lock and why does it scale?

MCS lock (Mellor-Crummey, Scott, 1991) is a queue-based spinlock where each waiter spins on its own cache line, not on a global lock variable. Each thread has a Node with a next pointer and a locked flag; on acquire, it atomically appends its node to the lock's tail and spins on its own node's locked flag. On release, the holder hands off by writing 0 to its successor's flag. Cache traffic is O(1) per handoff regardless of how many threads are waiting — each waiter's cache line is touched only twice (once when spinning starts, once when woken). The Linux kernel's qspinlock (since 2014) is a 32-bit MCS variant.

Q: Why does the PAUSE instruction help?

PAUSE (x86) is a hint to the CPU that this is a busy-wait loop. It does three things: (1) reduces speculative reads of the lock variable, preventing memory-order misspeculation when the lock is finally released; (2) saves power by deactivating execution units; (3) yields hyperthread resources to the sibling thread. Without PAUSE, a tight spin loop hammers the lock cache line with speculative loads and may be 10-50x slower at recovery when the lock is released. Modern Intel chips made PAUSE last 100+ cycles (was around 10 in older chips) — Skylake bumped it to 140. ARM has YIELD with similar intent.

Q: Can a spinlock cause priority inversion?

Yes — and worse than mutexes. If a low-priority thread holds a spinlock and a high-priority thread on the same CPU spins on it, the high-priority thread runs forever (busy-waiting), starving the low-priority thread, which can never release. This is unbounded priority inversion. Real-time systems either disable preemption around spinlocks (Linux preemptible kernel does this — spin_lock disables preemption on the holder's CPU), use priority-inheriting mutexes instead, or pin lock holders. Linux's RT-PREEMPT patchset converts most kernel spinlocks to PI-mutexes for predictable latency.

Q: Why does Linux use spinlocks in interrupts?

Interrupt handlers cannot sleep. The kernel runs them in a special context (interrupt context) that has no associated process — there's no thread to put to sleep, and no scheduler can be invoked. So any synchronization needed inside an interrupt handler must be non-blocking: spinlocks are the only option (apart from atomics or RCU). The catch: if process context holds the same spinlock and an interrupt fires on the same CPU while it's held, deadlock. Linux solves this with spin_lock_irqsave: disables interrupts on the local CPU, takes the spinlock, restores interrupts on release.

A lock that loops checking instead of sleeping — useful for short critical sections in kernel code

A spinlock is a mutual-exclusion primitive that, when contended, busy-waits in a tight loop ("spins") rather than putting the thread to sleep. Implemented with an atomic test-and-set or compare-and-swap on a single boolean. On modern multicores, the basic spinlock is augmented with PAUSE/YIELD instructions, exponential backoff, and ticket/MCS variants for fairness. Used in Linux kernel for very short critical sections (<1 µs) where context switch overhead (~1-5 µs) would dominate. Wrong choice for long sections: spinning wastes CPU, and a single owner sleeping while holding it is catastrophic — hence kernel rules forbid sleeping while holding a spinlock.

Acquire (uncontended)~30 cycles
Context switch cost1-5 µs
PAUSE cycles~10-100
Ticket variantFIFO fairness
MCS variantCache-line per waiter
Kernel ruleNever sleep while holding

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The simplest spinlock

A spinlock is a single boolean (or sometimes a single integer). To acquire, atomically swap or CAS — if the lock was 0, you set it to 1 and you own the lock; otherwise, retry until it becomes 0.

// Test-and-set spinlock — the textbook version
typedef atomic_bool spinlock_t;

void spin_lock(spinlock_t *lock) {
  while (atomic_exchange(lock, true)) {
    while (atomic_load(lock)) cpu_relax();  // PAUSE
  }
}

void spin_unlock(spinlock_t *lock) {
  atomic_store(lock, false);
}

The inner loop is the "test-test-and-set" optimization — read with plain loads (cheap, doesn't acquire cache line) until the lock looks free, then attempt the atomic exchange. This avoids cache-line ping-ponging when many threads are waiting; only the load is performed in the hot loop, and the atomic op happens once per acquisition attempt.

The acquire latency uncontended is roughly 30 cycles (one atomic exchange on a cache-resident line). The release is a plain store — single-digit nanoseconds.

Spinlock vs sleeping mutex

The choice between spinlock and mutex hinges on critical-section length:

Sleeping mutex. On contention, the kernel parks the thread (futex_wait on Linux, WaitForSingleObject on Windows). The thread is removed from the runqueue. When the lock releases, the kernel wakes a waiter. Total contended cost: 1-5 microseconds for the context switch round trip plus the kernel call overhead.
Spinlock. On contention, busy-wait. No kernel involvement, no context switch. Contended cost is just the cache-coherence latency to detect the release — typically tens of nanoseconds. But you burn 100% CPU while waiting.

If the critical section is under 1 microsecond, spinning is cheaper than the context-switch cost — so a spinlock wins. If it's longer (say, doing I/O, computing something nontrivial), sleeping is cheaper because you let other work proceed.

The Linux kernel uses both. spinlock_t for sub-microsecond regions (e.g., updating per-CPU stats, modifying small in-memory data structures), mutex or semaphore for longer regions where sleeping is acceptable.

Ticket lock: fairness

The basic test-and-set spinlock is unfair. Whichever thread happens to win the next CAS gets the lock — there's no FIFO order. Under heavy contention, some threads may starve while others repeatedly win.

The ticket lock fixes this. The lock has two counters:

struct ticket_lock {
  atomic_uint32 next_ticket;
  atomic_uint32 now_serving;
};

void ticket_lock(ticket_lock *t) {
  uint32_t my = atomic_fetch_add(&t->next_ticket, 1);
  while (atomic_load(&t->now_serving) != my) cpu_relax();
}

void ticket_unlock(ticket_lock *t) {
  atomic_fetch_add(&t->now_serving, 1);
}

To acquire: atomically increment next_ticket, returning your own number — that's your "ticket." Then spin on now_serving until it equals your ticket. To release: increment now_serving by 1, releasing the next-numbered waiter. Threads acquire in strict FIFO order — no starvation.

The Linux kernel used ticket locks from 2008 to 2014. The downside: every release invalidates the cache line on every waiter, since they're all spinning on now_serving. Cache traffic is O(N) per handoff. At 100+ cores, this becomes the bottleneck.

MCS lock: scalable fairness

The MCS lock (Mellor-Crummey and Scott, 1991) is the queue-based answer. Each waiter spins on its own cache line, not a global one. Cache traffic is O(1) per handoff.

struct mcs_node {
  atomic<mcs_node*> next;
  atomic_bool       locked;
};

struct mcs_lock {
  atomic<mcs_node*> tail;
};

void mcs_lock(mcs_lock *L, mcs_node *me) {
  me->next   = nullptr;
  me->locked = true;
  mcs_node *prev = atomic_exchange(&L->tail, me);
  if (prev) {
    prev->next = me;            // splice me onto chain
    while (me->locked) cpu_relax();
  }
}

void mcs_unlock(mcs_lock *L, mcs_node *me) {
  if (!me->next) {
    if (CAS(&L->tail, me, nullptr)) return;  // we were tail
    while (!me->next) cpu_relax();           // wait for splice
  }
  me->next->locked = false;                  // wake successor
}

To acquire, atomically append your node to the tail of a linked list. If the list was empty, you own the lock immediately. Otherwise, point the previous tail at you and spin on your own node's locked flag. To release, write 0 to your successor's flag — exactly one cache line touched, owned by exactly one waiter.

Linux's qspinlock (queued spinlock) since 2014 is a clever 32-bit MCS variant. It encodes the lock state in 4 bytes — no per-thread node allocation in the common case (low contention) — and falls back to MCS-style queueing under heavy contention. It scales well to hundreds of cores.

PAUSE and YIELD: telling the CPU "I'm spinning"

A naive spin loop has two performance problems on modern superscalar CPUs:

Memory-order misspeculation. The CPU speculatively executes loads of the lock variable, expecting them to remain unchanged. When another core finally writes the lock, the speculation is invalidated and a pipeline flush costs 50+ cycles per recovery.
Hyperthreading starvation. A spinning thread holds execution units that its sibling logical core could use.

x86 PAUSE (also written as rep nop) hints to the CPU "this is a busy-wait loop." It deactivates speculation and yields hyperthread resources. Skylake increased PAUSE latency to ~140 cycles (was ~10 on older chips), making the hint cheaper to ignore but more effective when used.

ARM has YIELD with similar semantics. Both are emitted by cpu_relax() in Linux, std::this_thread::yield() rough equivalent in C++, and std::hint::spin_loop() in Rust.

Why spinlocks matter

Kernel scheduling. Linux uses spinlocks throughout the kernel: scheduler runqueue, page table updates, IRQ handling, per-CPU statistics. Replacing them with mutexes would slow the kernel by orders of magnitude.
Real-time systems. RTOS kernels need predictable, bounded acquire times. Sleeping mutexes have unbounded wakeup latency. Priority-inheriting spinlocks are the standard tool.
Low-latency networking. DPDK, kernel-bypass NICs, and HFT systems use spinlocks because every microsecond of context-switch overhead matters.
Userspace adaptive locks. glibc's pthread_mutex starts by spinning a short while, then falls back to a futex sleep — combining both strategies.
Database engines. Page latches, buffer-pool latches, and transaction-table latches are typically spinlocks because they're held for sub-microsecond windows.
Lock-free fallback. Many lock-free data structures use a spinlock-protected "stop the world" path for rare slow-path operations (e.g., resizing a hash table).

Common misconceptions

Always faster than mutex. Only for short critical sections. For longer sections (microseconds and up), spinning wastes CPU and a sleeping mutex is faster.
Fair by default. The basic test-and-set spinlock is unfair — threads acquire in arbitrary order, and starvation is possible. Need ticket or MCS variants for fairness.
Spinlocks scale infinitely. The basic spinlock collapses under heavy contention because every waiter hammers the same cache line. MCS solves this; basic test-and-set does not.
OK to sleep while holding. A thread that sleeps while holding a spinlock can deadlock the system: other waiters spin, the holder doesn't run, no progress. Linux kernel rule: no sleeping while holding a spinlock — checked by lockdep.
OK in user code. Userspace spinlocks are usually wrong: the OS may preempt the holder for an arbitrary time, leaving waiters spinning unproductively. Use a futex-based mutex or std::mutex unless you have specific reason and can disable preemption.
PAUSE is a no-op. Old PAUSE was ~10 cycles. Modern Skylake+ PAUSE is 100+ cycles — a real CPU yield. Tight spin loops without PAUSE are 10-50× worse on lock release recovery.

Frequently asked questions

When does a spinlock beat a mutex?

A spinlock beats a mutex when the critical section is shorter than the cost of a context switch. On Linux, a context switch costs 1-5 microseconds (saving registers, switching page tables, scheduler invocation, cache pollution). A typical sleeping mutex's slow path involves at least one context switch on contention. If your critical section is under 1 microsecond — for example, updating a per-CPU statistic, or a quick pointer swap — spinning wastes less CPU than sleeping. Rule of thumb in the Linux kernel: spinlocks for sub-microsecond sections, mutexes for everything longer.

What is a ticket lock?

A ticket lock is a fair spinlock — threads acquire in FIFO order. The lock has two counters: next_ticket (incremented on acquire) and now_serving (incremented on release). To acquire, a thread atomically fetches and increments next_ticket — that's its ticket number. It then spins reading now_serving until now_serving equals its ticket. To release, it increments now_serving by 1, waking the next waiter. Used in the Linux kernel from 2008-2014 (replaced by qspinlock for better cache behavior). FIFO fairness prevents starvation but every release invalidates the cache line on every waiter — O(N) cache traffic per handoff.

What is an MCS lock and why does it scale?

MCS lock (Mellor-Crummey, Scott, 1991) is a queue-based spinlock where each waiter spins on its own cache line, not on a global lock variable. Each thread has a Node with a next pointer and a locked flag; on acquire, it atomically appends its node to the lock's tail and spins on its own node's locked flag. On release, the holder hands off by writing 0 to its successor's flag. Cache traffic is O(1) per handoff regardless of how many threads are waiting — each waiter's cache line is touched only twice (once when spinning starts, once when woken). The Linux kernel's qspinlock (since 2014) is a 32-bit MCS variant.

Why does the PAUSE instruction help?

PAUSE (x86) is a hint to the CPU that this is a busy-wait loop. It does three things: (1) reduces speculative reads of the lock variable, preventing memory-order misspeculation when the lock is finally released; (2) saves power by deactivating execution units; (3) yields hyperthread resources to the sibling thread. Without PAUSE, a tight spin loop hammers the lock cache line with speculative loads and may be 10-50x slower at recovery when the lock is released. Modern Intel chips made PAUSE last 100+ cycles (was around 10 in older chips) — Skylake bumped it to 140. ARM has YIELD with similar intent.

Can a spinlock cause priority inversion?

Yes — and worse than mutexes. If a low-priority thread holds a spinlock and a high-priority thread on the same CPU spins on it, the high-priority thread runs forever (busy-waiting), starving the low-priority thread, which can never release. This is unbounded priority inversion. Real-time systems either disable preemption around spinlocks (Linux preemptible kernel does this — spin_lock disables preemption on the holder's CPU), use priority-inheriting mutexes instead, or pin lock holders. Linux's RT-PREEMPT patchset converts most kernel spinlocks to PI-mutexes for predictable latency.

Why does Linux use spinlocks in interrupts?

Interrupt handlers cannot sleep. The kernel runs them in a special context (interrupt context) that has no associated process — there's no thread to put to sleep, and no scheduler can be invoked. So any synchronization needed inside an interrupt handler must be non-blocking: spinlocks are the only option (apart from atomics or RCU). The catch: if process context holds the same spinlock and an interrupt fires on the same CPU while it's held, deadlock. Linux solves this with spin_lock_irqsave: disables interrupts on the local CPU, takes the spinlock, restores interrupts on release.