Concurrency

Spinlock

A lock that loops checking instead of sleeping — useful for short critical sections in kernel code

A spinlock is a mutual-exclusion primitive that, when contended, busy-waits in a tight loop ("spins") rather than putting the thread to sleep. Implemented with an atomic test-and-set or compare-and-swap on a single boolean. On modern multicores, the basic spinlock is augmented with PAUSE/YIELD instructions, exponential backoff, and ticket/MCS variants for fairness. Used in Linux kernel for very short critical sections (<1 µs) where context switch overhead (~1-5 µs) would dominate. Wrong choice for long sections: spinning wastes CPU, and a single owner sleeping while holding it is catastrophic — hence kernel rules forbid sleeping while holding a spinlock.

  • Acquire (uncontended)~30 cycles
  • Context switch cost1-5 µs
  • PAUSE cycles~10-100
  • Ticket variantFIFO fairness
  • MCS variantCache-line per waiter
  • Kernel ruleNever sleep while holding

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The simplest spinlock

A spinlock is a single boolean (or sometimes a single integer). To acquire, atomically swap or CAS — if the lock was 0, you set it to 1 and you own the lock; otherwise, retry until it becomes 0.

// Test-and-set spinlock — the textbook version
typedef atomic_bool spinlock_t;

void spin_lock(spinlock_t *lock) {
  while (atomic_exchange(lock, true)) {
    while (atomic_load(lock)) cpu_relax();  // PAUSE
  }
}

void spin_unlock(spinlock_t *lock) {
  atomic_store(lock, false);
}

The inner loop is the "test-test-and-set" optimization — read with plain loads (cheap, doesn't acquire cache line) until the lock looks free, then attempt the atomic exchange. This avoids cache-line ping-ponging when many threads are waiting; only the load is performed in the hot loop, and the atomic op happens once per acquisition attempt.

The acquire latency uncontended is roughly 30 cycles (one atomic exchange on a cache-resident line). The release is a plain store — single-digit nanoseconds.

Spinlock vs sleeping mutex

The choice between spinlock and mutex hinges on critical-section length:

  • Sleeping mutex. On contention, the kernel parks the thread (futex_wait on Linux, WaitForSingleObject on Windows). The thread is removed from the runqueue. When the lock releases, the kernel wakes a waiter. Total contended cost: 1-5 microseconds for the context switch round trip plus the kernel call overhead.
  • Spinlock. On contention, busy-wait. No kernel involvement, no context switch. Contended cost is just the cache-coherence latency to detect the release — typically tens of nanoseconds. But you burn 100% CPU while waiting.

If the critical section is under 1 microsecond, spinning is cheaper than the context-switch cost — so a spinlock wins. If it's longer (say, doing I/O, computing something nontrivial), sleeping is cheaper because you let other work proceed.

The Linux kernel uses both. spinlock_t for sub-microsecond regions (e.g., updating per-CPU stats, modifying small in-memory data structures), mutex or semaphore for longer regions where sleeping is acceptable.

Ticket lock: fairness

The basic test-and-set spinlock is unfair. Whichever thread happens to win the next CAS gets the lock — there's no FIFO order. Under heavy contention, some threads may starve while others repeatedly win.

The ticket lock fixes this. The lock has two counters:

struct ticket_lock {
  atomic_uint32 next_ticket;
  atomic_uint32 now_serving;
};

void ticket_lock(ticket_lock *t) {
  uint32_t my = atomic_fetch_add(&t->next_ticket, 1);
  while (atomic_load(&t->now_serving) != my) cpu_relax();
}

void ticket_unlock(ticket_lock *t) {
  atomic_fetch_add(&t->now_serving, 1);
}

To acquire: atomically increment next_ticket, returning your own number — that's your "ticket." Then spin on now_serving until it equals your ticket. To release: increment now_serving by 1, releasing the next-numbered waiter. Threads acquire in strict FIFO order — no starvation.

The Linux kernel used ticket locks from 2008 to 2014. The downside: every release invalidates the cache line on every waiter, since they're all spinning on now_serving. Cache traffic is O(N) per handoff. At 100+ cores, this becomes the bottleneck.

MCS lock: scalable fairness

The MCS lock (Mellor-Crummey and Scott, 1991) is the queue-based answer. Each waiter spins on its own cache line, not a global one. Cache traffic is O(1) per handoff.

struct mcs_node {
  atomic<mcs_node*> next;
  atomic_bool       locked;
};

struct mcs_lock {
  atomic<mcs_node*> tail;
};

void mcs_lock(mcs_lock *L, mcs_node *me) {
  me->next   = nullptr;
  me->locked = true;
  mcs_node *prev = atomic_exchange(&L->tail, me);
  if (prev) {
    prev->next = me;            // splice me onto chain
    while (me->locked) cpu_relax();
  }
}

void mcs_unlock(mcs_lock *L, mcs_node *me) {
  if (!me->next) {
    if (CAS(&L->tail, me, nullptr)) return;  // we were tail
    while (!me->next) cpu_relax();           // wait for splice
  }
  me->next->locked = false;                  // wake successor
}

To acquire, atomically append your node to the tail of a linked list. If the list was empty, you own the lock immediately. Otherwise, point the previous tail at you and spin on your own node's locked flag. To release, write 0 to your successor's flag — exactly one cache line touched, owned by exactly one waiter.

Linux's qspinlock (queued spinlock) since 2014 is a clever 32-bit MCS variant. It encodes the lock state in 4 bytes — no per-thread node allocation in the common case (low contention) — and falls back to MCS-style queueing under heavy contention. It scales well to hundreds of cores.

PAUSE and YIELD: telling the CPU "I'm spinning"

A naive spin loop has two performance problems on modern superscalar CPUs:

  • Memory-order misspeculation. The CPU speculatively executes loads of the lock variable, expecting them to remain unchanged. When another core finally writes the lock, the speculation is invalidated and a pipeline flush costs 50+ cycles per recovery.
  • Hyperthreading starvation. A spinning thread holds execution units that its sibling logical core could use.

x86 PAUSE (also written as rep nop) hints to the CPU "this is a busy-wait loop." It deactivates speculation and yields hyperthread resources. Skylake increased PAUSE latency to ~140 cycles (was ~10 on older chips), making the hint cheaper to ignore but more effective when used.

ARM has YIELD with similar semantics. Both are emitted by cpu_relax() in Linux, std::this_thread::yield() rough equivalent in C++, and std::hint::spin_loop() in Rust.

Why spinlocks matter

  • Kernel scheduling. Linux uses spinlocks throughout the kernel: scheduler runqueue, page table updates, IRQ handling, per-CPU statistics. Replacing them with mutexes would slow the kernel by orders of magnitude.
  • Real-time systems. RTOS kernels need predictable, bounded acquire times. Sleeping mutexes have unbounded wakeup latency. Priority-inheriting spinlocks are the standard tool.
  • Low-latency networking. DPDK, kernel-bypass NICs, and HFT systems use spinlocks because every microsecond of context-switch overhead matters.
  • Userspace adaptive locks. glibc's pthread_mutex starts by spinning a short while, then falls back to a futex sleep — combining both strategies.
  • Database engines. Page latches, buffer-pool latches, and transaction-table latches are typically spinlocks because they're held for sub-microsecond windows.
  • Lock-free fallback. Many lock-free data structures use a spinlock-protected "stop the world" path for rare slow-path operations (e.g., resizing a hash table).

Common misconceptions

  • Always faster than mutex. Only for short critical sections. For longer sections (microseconds and up), spinning wastes CPU and a sleeping mutex is faster.
  • Fair by default. The basic test-and-set spinlock is unfair — threads acquire in arbitrary order, and starvation is possible. Need ticket or MCS variants for fairness.
  • Spinlocks scale infinitely. The basic spinlock collapses under heavy contention because every waiter hammers the same cache line. MCS solves this; basic test-and-set does not.
  • OK to sleep while holding. A thread that sleeps while holding a spinlock can deadlock the system: other waiters spin, the holder doesn't run, no progress. Linux kernel rule: no sleeping while holding a spinlock — checked by lockdep.
  • OK in user code. Userspace spinlocks are usually wrong: the OS may preempt the holder for an arbitrary time, leaving waiters spinning unproductively. Use a futex-based mutex or std::mutex unless you have specific reason and can disable preemption.
  • PAUSE is a no-op. Old PAUSE was ~10 cycles. Modern Skylake+ PAUSE is 100+ cycles — a real CPU yield. Tight spin loops without PAUSE are 10-50× worse on lock release recovery.

Frequently asked questions

When does a spinlock beat a mutex?

A spinlock beats a mutex when the critical section is shorter than the cost of a context switch. On Linux, a context switch costs 1-5 microseconds (saving registers, switching page tables, scheduler invocation, cache pollution). A typical sleeping mutex's slow path involves at least one context switch on contention. If your critical section is under 1 microsecond — for example, updating a per-CPU statistic, or a quick pointer swap — spinning wastes less CPU than sleeping. Rule of thumb in the Linux kernel: spinlocks for sub-microsecond sections, mutexes for everything longer.

What is a ticket lock?

A ticket lock is a fair spinlock — threads acquire in FIFO order. The lock has two counters: next_ticket (incremented on acquire) and now_serving (incremented on release). To acquire, a thread atomically fetches and increments next_ticket — that's its ticket number. It then spins reading now_serving until now_serving equals its ticket. To release, it increments now_serving by 1, waking the next waiter. Used in the Linux kernel from 2008-2014 (replaced by qspinlock for better cache behavior). FIFO fairness prevents starvation but every release invalidates the cache line on every waiter — O(N) cache traffic per handoff.

What is an MCS lock and why does it scale?

MCS lock (Mellor-Crummey, Scott, 1991) is a queue-based spinlock where each waiter spins on its own cache line, not on a global lock variable. Each thread has a Node with a next pointer and a locked flag; on acquire, it atomically appends its node to the lock's tail and spins on its own node's locked flag. On release, the holder hands off by writing 0 to its successor's flag. Cache traffic is O(1) per handoff regardless of how many threads are waiting — each waiter's cache line is touched only twice (once when spinning starts, once when woken). The Linux kernel's qspinlock (since 2014) is a 32-bit MCS variant.

Why does the PAUSE instruction help?

PAUSE (x86) is a hint to the CPU that this is a busy-wait loop. It does three things: (1) reduces speculative reads of the lock variable, preventing memory-order misspeculation when the lock is finally released; (2) saves power by deactivating execution units; (3) yields hyperthread resources to the sibling thread. Without PAUSE, a tight spin loop hammers the lock cache line with speculative loads and may be 10-50x slower at recovery when the lock is released. Modern Intel chips made PAUSE last 100+ cycles (was around 10 in older chips) — Skylake bumped it to 140. ARM has YIELD with similar intent.

Can a spinlock cause priority inversion?

Yes — and worse than mutexes. If a low-priority thread holds a spinlock and a high-priority thread on the same CPU spins on it, the high-priority thread runs forever (busy-waiting), starving the low-priority thread, which can never release. This is unbounded priority inversion. Real-time systems either disable preemption around spinlocks (Linux preemptible kernel does this — spin_lock disables preemption on the holder's CPU), use priority-inheriting mutexes instead, or pin lock holders. Linux's RT-PREEMPT patchset converts most kernel spinlocks to PI-mutexes for predictable latency.

Why does Linux use spinlocks in interrupts?

Interrupt handlers cannot sleep. The kernel runs them in a special context (interrupt context) that has no associated process — there's no thread to put to sleep, and no scheduler can be invoked. So any synchronization needed inside an interrupt handler must be non-blocking: spinlocks are the only option (apart from atomics or RCU). The catch: if process context holds the same spinlock and an interrupt fires on the same CPU while it's held, deadlock. Linux solves this with spin_lock_irqsave: disables interrupts on the local CPU, takes the spinlock, restores interrupts on release.