Data Structures

Ring Buffer

Fixed-size circular queue, lock-free at 100M ops/s

A ring buffer is a fixed-size circular array with head and tail pointers that wrap modulo capacity. Producers write at head, consumers read at tail. Lock-free SPSC variants drive audio, kernel logs, and io_uring.

  • Enqueue / dequeueO(1) worst case
  • SpaceFixed at construction
  • Allocations per opZero
  • SPSC throughput (typical)~100M ops/s per pair
  • Capacity conventionPower of two (bitmask wrap)
  • Memory layoutOne contiguous array

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a ring buffer works

Picture a circular conveyor belt with eight bins. A worker on one side drops widgets into the next empty bin — that's the producer at the head. A worker on the other side picks widgets out of the next non-empty bin — that's the consumer at the tail. Both walk around the circle. When the producer's hand passes where the consumer's hand is, the belt is full; when the consumer catches up to the producer, it's empty.

In code, the belt is just an array, and the two hands are two indices that increment modulo the array length:

data:     [ _, _, _, _, _, _, _, _ ]   capacity = 8
head = 0  ↑                              (next write position)
tail = 0  ↑                              (next read position)

push(A): data[head] = A; head = (head + 1) % 8
pop():   x = data[tail]; tail = (tail + 1) % 8; return x

That's the entire structure. Every push and every pop is two memory accesses and one increment — no allocation, no resize, no copying. The cost: the buffer's size is fixed. If you push faster than you pop, you either overwrite old data (a "lossy" ring) or refuse the push (a "blocking" ring). Both modes are useful for different problems.

The empty-vs-full ambiguity

The naive scheme above has a subtle bug: when head == tail, you can't tell empty from full. After eight pushes with no pops, head wraps around and lands on tail again — but the buffer is now packed, not drained. Three standard fixes:

  1. Sacrifice one slot. Define "full" as (head + 1) % capacity == tail. The buffer effectively holds capacity − 1 items. Simplest fix; the one taught in OS textbooks.
  2. Track count separately. Maintain a third atomic size field. Costs an extra memory operation per push/pop and a contention point between producer and consumer, so this is the worst choice for high-throughput lock-free designs.
  3. Use unbounded counters. Let head and tail be 64-bit monotonic integers that never wrap. Compute the slot as head & (capacity − 1). Empty: head == tail. Full: head − tail == capacity. Linux io_uring uses this. At 100M ops/s, a 64-bit counter overflows after ~5,800 years.

The lock-free single-producer single-consumer pattern

The killer use case for ring buffers is hand-off between exactly two threads with no kernel intervention. The recipe:

  • Producer thread only writes the head index and only reads the tail.
  • Consumer thread only writes the tail index and only reads the head.
  • Indices live on separate cache lines (padding) so updates don't trigger false sharing.
  • The producer's update to data must happen-before its update to head — release semantics. The consumer's read of head must come before its read of data — acquire semantics.

Because the two threads never touch the same word, there's no CAS retry loop, no contention, no spinning. On a 3 GHz x86 chip, a tuned SPSC ring hits 100 million transfers per second per core pair — about 10 nanoseconds per item, end to end.

Generalizations exist (MPSC, SPMC, MPMC) and they all need atomic CAS on the contested index. Throughput drops accordingly: a well-tuned MPMC ring like Vyukov's bounded MPMC queue does 20–40M ops/s, still vastly faster than mutex-protected designs.

When to use a ring buffer

  • Real-time audio. The audio callback runs on a high-priority thread that must not block, allocate, or take locks. A SPSC ring between the audio thread and the rest of the app is the universal pattern — JACK, PortAudio, CoreAudio all rely on it.
  • Kernel logs. Linux's printk writes to a static ring buffer. New messages overwrite oldest when the buffer fills, but you can read every recent message via dmesg without paying allocation costs in interrupt context.
  • io_uring. Both submission and completion queues are shared-memory ring buffers between userspace and kernel. Submitting an I/O is one write to the SQE slot and one store to the head index. Zero syscalls on the hot path.
  • Network packet rings. AF_PACKET MMAP rings, DPDK rx/tx queues, and every modern NIC driver use ring buffers to hand packets between the NIC and the CPU.
  • Streaming media buffers. Video decoders push decoded frames to a ring; the renderer pops them on vsync. Smooths jitter without blocking the decoder.
  • Bounded producer/consumer pipelines. Anywhere a fast producer might outrun a slow consumer and you want explicit backpressure rather than unbounded memory growth.

Ring buffer vs deque vs unbounded queue

Ring bufferDeque (block list)Mutex-locked queueLinked-list queue
Push / popO(1) worst caseO(1) amortizedO(1) + lockO(1) + allocation
Allocation per opZeroZero amortizedZero or oneOne per push
Bounded sizeYes (fixed)NoNoNo
Cache localityExcellent — one arrayGood — block of arraysGoodPoor — scattered
Lock-free SPSCYes, simpleHardNo, by definitionPossible (Michael-Scott)
Random accessYes (modulo arithmetic)Yes (block index)NoNo
BackpressureBuilt-in (full = reject)NoneNoneNone
Typical SPSC throughput~100M ops/s~30M ops/s~3M ops/s~10M ops/s

Ring buffers win whenever the workload has a known maximum burst size and the cost of allocation or locking matters. They lose when sizes are wildly variable or you genuinely need unbounded growth — there, a chunked deque or an unbounded MPMC queue is the right answer.

Pseudo-code (SPSC, power-of-two capacity)

// Capacity is a power of two; mask = capacity - 1.

push(value):  // producer thread
    h = head.load(relaxed)
    t = tail.load(acquire)
    if h - t == capacity: return FULL
    data[h & mask] = value
    head.store(h + 1, release)
    return OK

pop():  // consumer thread
    t = tail.load(relaxed)
    h = head.load(acquire)
    if h == t: return EMPTY
    v = data[t & mask]
    tail.store(t + 1, release)
    return v

Two memory ordering pairs ensure correctness: producer's release of head pairs with consumer's acquire of head; consumer's release of tail pairs with producer's acquire of tail. Other accesses can be relaxed.

JavaScript implementation

JavaScript doesn't have true threads, but you can use a ring buffer over a SharedArrayBuffer to pass data between a main thread and a Web Worker without serialization. The implementation below shows the single-threaded API; the SAB version replaces the indices with Atomics.

class RingBuffer {
  constructor(capacity) {
    // Round up to power of two for fast modulo.
    let n = 1; while (n < capacity) n <<= 1;
    this.capacity = n;
    this.mask = n - 1;
    this.data = new Array(n);
    this.head = 0;       // next write
    this.tail = 0;       // next read
  }

  push(value) {
    if (this.head - this.tail === this.capacity) return false;  // full
    this.data[this.head & this.mask] = value;
    this.head++;
    return true;
  }

  pop() {
    if (this.head === this.tail) return undefined;              // empty
    const v = this.data[this.tail & this.mask];
    this.data[this.tail & this.mask] = undefined;               // help GC
    this.tail++;
    return v;
  }

  get size()  { return this.head - this.tail; }
  get isEmpty() { return this.head === this.tail; }
  get isFull()  { return this.size === this.capacity; }
}

// SharedArrayBuffer SPSC variant — producer in main thread, consumer in worker.
class SABRing {
  constructor(sab, capacity) {
    this.mask = capacity - 1;
    this.idx  = new Int32Array(sab, 0, 2);     // [head, tail]
    this.data = new Float64Array(sab, 16);     // payload slots
  }
  push(v) {
    const h = Atomics.load(this.idx, 0);
    const t = Atomics.load(this.idx, 1);
    if (h - t === this.mask + 1) return false;
    this.data[h & this.mask] = v;
    Atomics.store(this.idx, 0, h + 1);  // release on x86; tighter fences on ARM
    return true;
  }
}

Python implementation

from collections import deque
from threading import Lock

class RingBuffer:
    """Single-threaded ring. For multi-thread, wrap with Lock or use queue.Queue."""

    def __init__(self, capacity):
        n = 1
        while n < capacity:
            n <<= 1
        self.capacity = n
        self.mask = n - 1
        self.data = [None] * n
        self.head = 0
        self.tail = 0

    def push(self, value):
        if self.head - self.tail == self.capacity:
            return False
        self.data[self.head & self.mask] = value
        self.head += 1
        return True

    def pop(self):
        if self.head == self.tail:
            return None
        v = self.data[self.tail & self.mask]
        self.data[self.tail & self.mask] = None
        self.tail += 1
        return v

    def __len__(self):
        return self.head - self.tail


# For lock-free SPSC across threads in CPython, the GIL handles atomicity of
# integer assignment. Use collections.deque(maxlen=N) for a built-in lossy ring:
ring = deque(maxlen=1024)
ring.append(1)  # auto-evicts oldest when full

For production multi-threaded Python work, prefer queue.Queue (mutex-protected) or multiprocessing.Queue (pipe-based). The lock-free design above only matters in compiled languages without a GIL.

Variants

  • Lossy ring (overwrite-on-full). Producer never blocks; new pushes overwrite the oldest unread slot. Used by Linux printk and Prometheus's WAL.
  • Bounded ring (reject-on-full). Push returns false when full. Provides built-in backpressure. Default in audio APIs.
  • Disruptor. LMAX's high-throughput ring with separate cursor-tracking sequences and pre-allocated slots reused in place. Powered LMAX Exchange at 6M+ messages/s on a single thread.
  • Multi-producer multi-consumer (MPMC). Vyukov's bounded queue uses a per-slot sequence number to coordinate writers and readers without a global lock. ~20–40M ops/s in well-tuned implementations.
  • Bipartite buffer (bipbuf). A linear-allocator-style ring that hands out contiguous regions of N bytes instead of single items. Critical for audio, where each callback wants a continuous block.
  • Lockless varint ring (printk). Variable-length records framed with length prefixes; consumer skips by reading the length. Used by Linux's kfifo and printk for log lines of differing sizes.

Performance notes — costed claims

  • SPSC throughput. A tuned SPSC ring with 64-bit indices on cache-line-padded structures reaches ~100 million transfers per second per CPU pair on x86 — confirmed by Disruptor, boost::lockfree::spsc_queue, and Rust's crossbeam::deque benchmarks.
  • Latency. End-to-end producer-to-consumer hand-off is 10–30 ns when the buffer is warm and both cores are on the same socket. Cross-socket pushes the cost to 80–200 ns due to coherence traffic.
  • False sharing. Without 64-byte padding between head and tail, throughput collapses by 5–10× — the two indices keep invalidating each other's cache line. Pad both atomics to a full cache line.
  • MPMC overhead. Replacing SPSC with MPMC drops throughput from ~100M to ~30M ops/s. CAS retries dominate under contention.
  • io_uring saturation. A pair of 4 KB-aligned ring buffers can drive 5 million 4 KiB read IOPS on modern NVMe — limited by the device, not the queue.

Common bugs and edge cases

  • Forgetting release/acquire fences. On x86, store-store ordering is guaranteed by hardware (TSO), so wrong code accidentally works. On ARM or POWER, the consumer reads stale data and silently corrupts state.
  • False sharing. If head and tail occupy the same cache line, every push invalidates the consumer's tail cache and vice versa. The fix is alignas(64) padding around each atomic.
  • Capacity not a power of two. x % 7 becomes a 20-cycle integer division on every push. With power-of-two capacity, x & (n - 1) is one cycle.
  • 32-bit index overflow. A 32-bit counter at 100M ops/s wraps in 43 seconds. Use 64-bit counters; the wrap is no longer a practical concern.
  • Reading partially-written items. Multi-word values pushed without a fence can be torn — consumer sees half the new value and half the old. Always update the head index after the data write, with a release fence.
  • Resize attempts. Ring buffers can't grow safely while concurrent producers and consumers run. If you need dynamic sizing, use a deque or chunked structure, not a ring.

Frequently asked questions

Why use a ring buffer instead of a regular queue?

Three reasons. First: no allocation per push. Memory is pre-reserved at construction, so producers never call malloc inside a real-time audio callback or a kernel interrupt handler. Second: cache locality. The entire buffer lives in one contiguous chunk, often pinned to a single page, so the CPU prefetcher loves it. Third: predictable latency. Every operation is O(1) wall-clock — no resizes, no GC pauses, no worst-case spikes.

How does a lock-free SPSC ring buffer actually work?

The producer owns the head index; the consumer owns the tail index. Neither ever writes the other's index. Each side reads the other side's index atomically (acquire load) and writes its own atomically (release store). Because there is one producer and one consumer, no two threads ever modify the same word, so no lock or CAS loop is needed. Throughput on modern x86 hits 100 million ops/s per pair.

How does the buffer distinguish empty from full?

When head == tail the buffer is empty. The classic trap is that when head wraps around and catches tail again, the buffer is now full but the indices look identical. Two common fixes: (1) keep one slot permanently empty so full means (head + 1) % capacity == tail, sacrificing one slot, or (2) store head and tail as 64-bit monotonic counters that never wrap and only do modulo when computing the array index. Linux io_uring uses the second technique.

Why does io_uring use ring buffers?

io_uring needs to submit and complete I/O without any syscall in the hot path. Submission queue and completion queue are both shared-memory ring buffers between userspace and the kernel. Userspace writes a submission queue entry, increments the producer index, and the kernel polls or sleeps on it. No copy, no syscall on submission once warmed up. This is how io_uring beats epoll for sustained I/O — saturating NVMe at 5M IOPS.

What's the right capacity for a ring buffer?

Always a power of two. That lets you replace 'index % capacity' with 'index & (capacity - 1)' — a single AND instruction instead of a division. For audio: 256–4096 samples typical. For kernel printk: 64 KB to 8 MB (the dmesg buffer). For io_uring: 64–8192 entries, sized to the burstiness of your workload. Bigger isn't always better — too big and the consumer's reads pull cold cache lines.

How does dmesg work?

The kernel reserves a static ring buffer (default 64 KB to 16 MB depending on config) and every printk call appends a line. When the buffer fills, new entries overwrite the oldest — there's no spill to disk during boot, when no filesystem exists yet. The dmesg command just reads this buffer through the syslog syscall. The 'ring' nature is what lets you see boot messages even if init fails.

Why must memory barriers go between data write and index update?

Modern CPUs and compilers reorder writes for performance. If a producer writes the data slot and then updates head, but the reordering lets head land in memory before the data, the consumer can read garbage. A release fence on the producer side and an acquire fence on the consumer side enforce the correct ordering. On x86 these are nearly free (TSO already guarantees most of it); on ARM and POWER you pay a few cycles per fence.