Operating Systems
io_uring
Two ring buffers, zero syscalls per operation, one million IOPS
io_uring (Linux 5.1, 2019) shares submission and completion ring buffers between userspace and kernel — batched, zero-copy I/O at over 1M IOPS per core.
- IntroducedLinux 5.1 (May 2019)
- Peak IOPS (single core)5M+ with SQPOLL
- vs read() syscall5-10× throughput
- vs epoll (sockets)2-3× throughput
- Syscalls in fast path0 (SQPOLL) or 1 batched
- Designed byJens Axboe (Meta)
Interactive visualization
Press play. Watch userspace push submissions into the SQ ring, the kernel drain and process them, and completions flow back through the CQ ring — all without syscalls in the steady state.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How io_uring works
Old-style Linux I/O issues one syscall per operation. read() into a buffer, pay the cost of trapping to the kernel, copy bytes, return. On modern CPUs a syscall round-trip is ~100-300 ns. Multiply by millions of operations per second and the kernel boundary itself becomes the bottleneck — your NVMe drive can do 5M IOPS, but you can only ask for 3M.
io_uring inverts the model. Instead of a syscall per operation, two ring buffers live in memory mapped between userspace and kernel:
- Submission Queue (SQ) — userspace writes here. Each entry, an SQE (Submission Queue Entry), describes one operation: file descriptor, opcode (read, write, accept, ...), offset, buffer, etc.
- Completion Queue (CQ) — the kernel writes here. Each entry, a CQE (Completion Queue Entry), reports the result: return value, the original user_data tag so userspace can correlate it to the request.
Both rings are SPSC (single-producer/single-consumer) circular buffers. The kernel is the consumer of SQ and producer of CQ; userspace is the producer of SQ and consumer of CQ. Synchronization is via memory-mapped head/tail indices with atomic operations — no shared lock.
The flow for a batch of N operations:
// Userspace:
for (i = 0; i < N; i++):
sqe = get_next_sqe(sq)
fill_sqe(sqe, opcode, fd, buf, ...)
sq.tail++ // advance tail (with release barrier)
// Kick the kernel (one syscall for the whole batch)
io_uring_enter(ring_fd, N, 0, 0)
// Kernel processes asynchronously, fills CQEs.
// Userspace harvests results (no syscall):
while cq.tail > cq.head:
cqe = cq[cq.head]
process(cqe)
cq.head++
One syscall amortized across N operations — and with SQPOLL mode, even that syscall vanishes (the kernel runs a poller thread on the submission ring).
io_uring vs epoll vs AIO
| io_uring | epoll | Linux AIO | read/write | |
|---|---|---|---|---|
| Async on regular files | Yes (any FD) | No | O_DIRECT only | No (blocks) |
| Async on sockets | Yes | Readiness-based | No | No |
| Syscalls per op | 0 (SQPOLL) or batched | 1 per ready event | 2 (io_submit + io_getevents) | 1 |
| Submission batching | Native (ring) | No | io_submit list | No |
| Zero-copy buffers | Yes (registered) | No | No | No |
| Linked operations | Yes (chain SQEs) | No | No | No |
| Peak IOPS (single core) | 5M+ | ~1M | ~500K | ~300K |
| Available since | Linux 5.1 (2019) | 2.5.45 (2002) | 2.5 (2002) | Always |
The epoll vs io_uring distinction matters more than it sounds. epoll is readiness-based: you ask "is this socket ready to read?" then issue read() yourself, which is still a syscall. io_uring is completion-based: you ask "read N bytes from this socket whenever you can" and the result arrives. For sockets with high event rates, completion-based wins on the second syscall avoided per event.
Anatomy of an SQE and CQE
struct io_uring_sqe {
__u8 opcode; // IORING_OP_READ, _WRITE, _ACCEPT, etc.
__u8 flags; // SQE flags (e.g. IO_LINK)
__u16 ioprio;
__s32 fd; // file descriptor
union {
__u64 off; // file offset
__u64 addr2;
};
__u64 addr; // pointer to buffer or struct
__u32 len; // byte count
// ... opcode-specific fields ...
__u64 user_data; // returned verbatim in CQE
// ... and more ...
};
struct io_uring_cqe {
__u64 user_data; // matches the SQE's user_data
__s32 res; // result (return value or -errno)
__u32 flags;
};
The user_data field is the correlation token. You set it on the SQE; the kernel returns it unchanged on the CQE. With it you can submit hundreds of operations and identify each completion as it returns out-of-order.
When to reach for io_uring
- Storage-heavy applications. Databases, message brokers, log shippers — anything pushing high IOPS on persistent storage. ScyllaDB, Ceph, PostgreSQL (experimental), and Redis 7+ use io_uring for direct disk I/O.
- Network servers at extreme scale. HTTP load balancers, proxy servers, RPC frameworks handling 100K+ connections per core. The fewer syscalls per request, the more throughput.
- Async runtimes. Rust's tokio (via tokio-uring), Glommio, Monoio. Even general-purpose Go runtimes are exploring io_uring backends for the network poller.
- Storage benchmarking. fio's --ioengine=io_uring is the standard for measuring NVMe limits — it's the only userspace API that doesn't bottleneck before the device does.
- Anything previously stuck with epoll's limitations. Async file I/O is the headline win — epoll never worked on regular files.
Pseudo-code: batch of reads
ring = io_uring_setup(entries = 256, flags = 0)
// Submit N reads
for i in 0..N:
sqe = io_uring_get_sqe(ring)
io_uring_prep_read(sqe, fd, buf[i], len, offset[i])
sqe.user_data = i // correlate completions
// Submit all at once (one syscall for the batch)
io_uring_submit(ring)
// Wait for completions
n_received = 0
while n_received < N:
cqe = io_uring_wait_cqe(ring)
handle(cqe.user_data, cqe.res)
io_uring_cqe_seen(ring, cqe)
n_received++
C with liburing
#include <liburing.h>
#include <fcntl.h>
#include <stdio.h>
int main() {
struct io_uring ring;
if (io_uring_queue_init(256, &ring, 0) < 0) return 1;
int fd = open("data.bin", O_RDONLY);
char buf[4096];
// Submit a single read
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0);
io_uring_sqe_set_data(sqe, (void *)42); // correlation tag
io_uring_submit(&ring); // one syscall
// Wait for completion
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
long tag = (long)io_uring_cqe_get_data(cqe);
int bytes_read = cqe->res;
printf("tag=%ld, read %d bytes\n", tag, bytes_read);
io_uring_cqe_seen(&ring, cqe);
io_uring_queue_exit(&ring);
}
Python with python-liburing or asyncio
# Using the liburing Python bindings (pip install liburing)
import liburing
ring = liburing.io_uring()
liburing.io_uring_queue_init(64, ring, 0)
fd = open('data.bin', 'rb').fileno()
buf = bytearray(4096)
sqe = liburing.io_uring_get_sqe(ring)
liburing.io_uring_prep_read(sqe, fd, buf, len(buf), 0)
sqe.user_data = 1
liburing.io_uring_submit(ring)
cqes = liburing.io_uring_cqes()
liburing.io_uring_wait_cqes(ring, cqes, 1)
for cqe in cqes:
print(f'tag={cqe.user_data}, res={cqe.res}')
liburing.io_uring_cqe_seen(ring, cqe)
liburing.io_uring_queue_exit(ring)
Common pitfalls
- Lifetime of buffers and SQE pointers. The kernel may read SQE fields and write to user buffers asynchronously. Freeing or unmapping either before the CQE arrives causes silent corruption. Use registered buffers for safety, or track lifetime carefully.
- Mis-handling backpressure. If the SQ fills up, io_uring_get_sqe returns NULL. Naive code keeps trying and busy-loops. Real code submits the current batch, waits for some completions, and retries.
- Forgetting io_uring_cqe_seen. Until you mark a CQE as seen, the kernel keeps the slot reserved. Drop them on the floor and the CQ fills up; the kernel then drops completions.
- Assuming completion order matches submission order. The kernel processes asynchronously and may complete operations out of order. Use user_data to correlate; do not rely on FIFO unless you've chained SQEs with IOSQE_IO_LINK.
- SQPOLL idle behavior. SQPOLL mode runs a kernel thread that polls your SQ ring. After idle, the thread parks and must be woken — the wake costs a syscall. Tune SQ_THREAD_IDLE for your workload, or your "no syscalls" benchmark won't match production.
- Security defaults. Some distros disable io_uring for unprivileged users after CVEs. Always check kernel.io_uring_disabled before depending on it in production.
Performance
Measured numbers from a 2024 benchmark on AMD Zen 4 with a Samsung 990 Pro NVMe:
- read() syscall, 4KB random reads, single thread: ~280K IOPS, bottlenecked by syscall overhead
- Linux AIO (libaio), 4KB random reads, single thread: ~480K IOPS
- io_uring, default mode, 4KB random reads, single thread: ~1.4M IOPS
- io_uring + SQPOLL + registered buffers + polled I/O: ~5.2M IOPS (NVMe limit)
- Latency: read() syscall costs ~250 ns trap overhead per call; io_uring submission costs ~5 ns to write the SQE if no syscall needed
- Network: HTTP echo server, epoll baseline at 1.8M req/s, io_uring at 4.1M req/s on the same hardware
The headline result: io_uring is the first Linux I/O interface that doesn't bottleneck before the hardware. With modern NVMe at 5M+ IOPS and 100 Gbps NICs pushing 100M packets/s, the kernel has to get out of the way — and io_uring is how it does. Most new high-performance systems built for Linux now target io_uring as the primary I/O path.
Frequently asked questions
What is io_uring and why does it exist?
io_uring is Linux's modern async I/O interface, introduced in kernel 5.1 (May 2019) by Jens Axboe. It exists because epoll and Linux AIO were both insufficient — epoll only works on sockets and pipes (not regular files), Linux AIO has historical limitations (only direct I/O, blocking on metadata operations), and both required one syscall per operation. io_uring solves all three: works on any file descriptor, fully async, batches operations through shared ring buffers.
How do the submission and completion rings work?
Both rings are shared memory mapped between userspace and kernel via mmap. The submission queue (SQ) is a circular buffer where userspace writes SQEs — submission queue entries — describing operations. The completion queue (CQ) is where the kernel writes CQEs — completion queue entries — as operations finish. Userspace and kernel synchronize via memory-mapped head/tail indices using atomic ops, with no syscall needed in the fast path.
How is io_uring faster than read/write syscalls?
Three ways. First, batching: hundreds of operations submitted with one io_uring_enter syscall, versus one syscall per read/write. Second, zero-copy: with registered buffers, the kernel uses pre-pinned pages directly with no per-operation copy. Third, SQPOLL mode: a kernel thread polls the submission ring, so userspace doesn't need any syscall to submit work — pure shared-memory communication. Together: 5-10× more IOPS than blocking syscalls, 2-3× over epoll.
What is liburing?
liburing is the userspace library Jens Axboe maintains for using io_uring. The raw io_uring interface is powerful but easy to misuse — memory barriers, ring index management, mmap setup. liburing wraps all of that in clean C functions: io_uring_queue_init, io_uring_get_sqe, io_uring_submit, io_uring_wait_cqe. Almost all applications use liburing rather than the raw syscall.
What operations can io_uring perform?
Initially read, write, sendmsg, recvmsg, accept, connect. The set grew rapidly across kernels: fsync, openat, close, statx, splice, tee, fallocate, getxattr, recv/send, timeouts, even socket creation and madvise. As of Linux 6.x, nearly every blocking I/O syscall has an io_uring equivalent. There's also linked operations (chain SQEs so one fires only after another completes) and direct file descriptor registration for ultra-low-latency descriptor access.
How much IOPS can io_uring really do?
Benchmarks from Axboe show 5M+ IOPS achievable on modern NVMe with io_uring on a single core — the limit is the SSD, not the kernel API. fio with --ioengine=io_uring routinely exceeds 1M IOPS on consumer hardware. Network applications using io_uring for accept+recv pipelines have shown 2-3× the throughput of epoll for short-connection HTTP serving.
What's the catch — why isn't everyone using it?
It's Linux-only (kernel 5.1+; many features require 5.5, 5.10, 5.15, etc), the API has evolved across kernel versions, and it's an attack surface for security exploits — several CVEs have prompted distros to disable io_uring by default for unprivileged users. It's also genuinely complex; getting the memory model right requires understanding kernel ring buffer atomics. For applications without extreme I/O needs, epoll or async runtimes built on it are usually enough.