Systems

I/O Multiplexing (select, poll, epoll, kqueue)

How one thread serves ten thousand sockets without breaking a sweat.

I/O multiplexing is the kernel facility that lets a single thread wait on many file descriptors and learn which are ready. select scans every fd on every call — O(N). epoll and kqueue keep a long-lived registration and report only the ready ones — O(active). io_uring goes further and batches the actual reads and writes with the wait.

  • select / poll costO(N) per call
  • epoll / kqueue costO(active)
  • FD_SETSIZE (select)1024 typical
  • epoll fdsTens of millions
  • io_uring per-op costNo syscall (batched)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why multiplexing exists

Naive thread-per-connection costs ~8 KB of kernel stack plus a default 8 MB of userspace stack per thread; at 10,000 connections that's 80 GB of address space and the scheduler is in tears (the C10K problem Dan Kegel posed in 1999). Multiplexing flips the model: one thread (or a small pool) registers many descriptors and asks the kernel "tell me when any are ready." Idle connections cost effectively nothing.

The five APIs

selectpollepollkqueueio_uring
PlatformPOSIXPOSIXLinux 2.6+BSD, macOSLinux 5.1+
ModelReadinessReadinessReadinessReadinessCompletion
Per-call costO(N)O(N)O(active)O(active)O(0) when busy
FD limitFD_SETSIZE (~1024)RLIMIT_NOFILENoneNoneNone
RegistrationRe-passed each callRe-passed each callPersistentPersistentPersistent ring
Edge / levelLevelLevelBoth (EPOLLET)Both (EV_CLEAR)N/A (completion)
Beyond socketsfds onlyfds only+ signalfd/timerfd/eventfdSockets, files, signals, timers, processesSockets, files, splice, NVMe
Notable extrasEPOLLEXCLUSIVE, EPOLLONESHOTEVFILT_PROC, EVFILT_TIMERSQPOLL, fixed buffers, linked SQEs

Readiness vs completion

The most common source of confusion. These APIs answer different questions:

  • Readiness (select, poll, epoll, kqueue): the kernel says "fd 7 has data buffered — your next non-blocking read will not EAGAIN." You then issue the read. Two syscalls per operation.
  • Completion (io_uring, Windows IOCP, POSIX AIO): you submit the read in advance; the kernel posts a completion with the bytes copied when done. No "is it ready?" round trip.

Completion scales better because you remove the syscall-per-op tax. It's harder to reason about because the buffer must remain valid until the kernel reports completion — possibly milliseconds later.

An edge-triggered epoll server

Edge-triggered is the high-performance default. The contract: every time epoll wakes you, you must drain every buffered byte for that fd, or you'll never hear about it again until more data arrives.

// epoll_server.c — minimal ET echo server (no error handling)
#include <sys/epoll.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <unistd.h>
#include <errno.h>

int main(void) {
  int lfd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
  int yes = 1; setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof yes);
  struct sockaddr_in a = { .sin_family = AF_INET, .sin_port = htons(8080) };
  bind(lfd, (struct sockaddr*)&a, sizeof a);
  listen(lfd, 4096);

  int ep = epoll_create1(EPOLL_CLOEXEC);
  struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = lfd };
  epoll_ctl(ep, EPOLL_CTL_ADD, lfd, &ev);

  struct epoll_event events[256];
  for (;;) {
    int n = epoll_wait(ep, events, 256, -1);
    for (int i = 0; i < n; i++) {
      int fd = events[i].data.fd;
      if (fd == lfd) {
        // Drain ALL pending accepts — we're edge-triggered
        for (;;) {
          int c = accept4(lfd, NULL, NULL, SOCK_NONBLOCK);
          if (c == -1) { if (errno == EAGAIN) break; continue; }
          struct epoll_event e = { .events = EPOLLIN | EPOLLET | EPOLLRDHUP, .data.fd = c };
          epoll_ctl(ep, EPOLL_CTL_ADD, c, &e);
        }
      } else {
        char buf[4096];
        for (;;) {                                      // drain ALL bytes
          ssize_t r = read(fd, buf, sizeof buf);
          if (r > 0) write(fd, buf, r);
          else if (r == 0 || (r == -1 && errno != EAGAIN)) { close(fd); break; }
          else break;                                   // EAGAIN — done
        }
      }
    }
  }
}

Forget either drain loop and the server drops connections under load — the most common epoll bug.

Python's selectors.DefaultSelector() picks the best API (epoll, kqueue, falling back to poll then select):

import selectors, socket
sel = selectors.DefaultSelector()
ls = socket.socket(); ls.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
ls.bind(('', 8080)); ls.listen(4096); ls.setblocking(False)
sel.register(ls, selectors.EVENT_READ, data='listen')
while True:
    for key, _ in sel.select(timeout=None):
        if key.data == 'listen':
            c, _ = key.fileobj.accept(); c.setblocking(False)
            sel.register(c, selectors.EVENT_READ, data='conn')
        else:
            try:
                d = key.fileobj.recv(4096)
                if not d: raise ConnectionResetError
                key.fileobj.sendall(d)
            except (BlockingIOError, ConnectionResetError):
                sel.unregister(key.fileobj); key.fileobj.close()

Node has no select binding because libuv hides it: net.createServer already runs an event loop on top of epoll/kqueue/IOCP. The pattern looks blocking but is fully multiplexed:

const net = require('node:net');
net.createServer(c => c.on('data', d => c.write(d))).listen(8080);
// libuv pulls events from epoll; one thread services thousands of sockets.

Costs that decide which wins

  • select, 10,000 idle fds: ~3.7 KB bitmap copied each call, kernel walks all 10,000 — several hundred µs per call.
  • epoll_wait, same N idle fds: cost proportional only to ready fds. A few active among 100,000 idle costs the same as a few among 100.
  • Edge-triggered drain miss: forgetting to loop on EAGAIN leaves the connection stuck — a latency tail with no CPU symptom.
  • io_uring: 4 KiB NVMe reads with SQPOLL sustain 7–10 M IOPS/core vs ~2 M for epoll + read.

Triggers, herds, and the fine print

Level-triggered vs edge-triggered

  • Level-triggered (epoll default, kqueue default): readiness is reported as long as the condition holds. Easy to use; you can stop reading mid-way and pick up later.
  • Edge-triggered (EPOLLET, EV_CLEAR): readiness is reported only on the transition. Faster — fewer wakes — but you must drain in a non-blocking loop. Pair with EPOLLONESHOT if you want to hand the fd to a worker thread without races.

EPOLLEXCLUSIVE

The thundering-herd cure. Multiple workers register the same listener with EPOLLEXCLUSIVE; the kernel wakes one waiter per accept event instead of all. Combine with SO_REUSEPORT for kernel-side load balancing across separate epoll fds.

kqueue's filters

Where epoll watches fds, kqueue watches events. EVFILT_READ/WRITE mirror epoll; EVFILT_TIMER, EVFILT_PROC (child exit), EVFILT_SIGNAL, EVFILT_VNODE (file change) make it a Swiss-army event facility. macOS Grand Central Dispatch sits on top of it.

io_uring tricks

  • SQPOLL: kernel thread polls the submission ring, eliminating syscalls on busy paths.
  • Linked SQEs: chain accept → recv → send for one-submission requests.
  • Registered buffers / fixed files: pre-pin pages and fds, skipping per-op lookups.

Common pitfalls

  • Edge-triggered without draining. Read once and you wait forever for the next byte. Always loop until EAGAIN.
  • Lost-wakeup race. Userspace observes "no data," then the peer sends, then userspace registers — but the readiness edge already happened. Register before the first read.
  • Stale fd in the epoll set. Closing a duplicated fd doesn't remove it from epoll if other references survive. epoll_ctl(EPOLL_CTL_DEL) before close.
  • FD_SETSIZE truncation. FD_SET with an fd ≥ 1024 silently writes past the bitmap. Switch to poll or epoll above a few hundred fds.
  • Mixing blocking and non-blocking. A blocking write on a full socket buffer parks the whole event loop. Mark every multiplexed fd O_NONBLOCK.
  • Treating EPOLLRDHUP as error. It's informational — peer closed write. You may still have buffered bytes worth reading. Drain first, close after.
  • io_uring buffer lifetime. Submitting a read into a stack buffer popped before completion is UB. Use long-lived or registered buffers.

Choosing

For portable code with a few hundred connections, use poll or your language's selectors abstraction. On Linux servers handling thousands+, use epoll edge-triggered with EPOLLEXCLUSIVE for accept fan-out. On BSD or macOS, use kqueue. For high-throughput storage stacks where syscalls dominate, learn io_uring — and accept that you'll re-read the docs at least three times.

Frequently asked questions

What's the difference between select and epoll?

select takes a fresh bitmap of every file descriptor on every call and scans them all in the kernel — O(N) per call, with a hard FD_SETSIZE limit of 1024 on most systems. epoll keeps a long-lived registration in the kernel and returns only the ready descriptors — O(active), no fixed limit. For 10,000 idle connections with a few hot ones, epoll is hundreds of times faster.

What's the difference between level-triggered and edge-triggered epoll?

Level-triggered (the default) reports a descriptor as ready every time you call epoll_wait while data is buffered. Edge-triggered (EPOLLET) reports readiness only on the transition from not-ready to ready — so you must drain the descriptor with a non-blocking loop until you get EAGAIN, or you'll never hear about the rest. Edge-triggered is faster but easy to get wrong.

What does EPOLLEXCLUSIVE solve?

When N worker threads share an epoll fd watching one listening socket, every accept-ready event used to wake all N — the thundering herd. EPOLLEXCLUSIVE (Linux 4.5+) tells the kernel to wake only one waiter per event, eliminating the herd at the cost of slightly less even distribution.

Is io_uring just a faster epoll?

It's a different model. epoll is readiness-based: the kernel says 'this fd is ready', userspace then issues read/write. io_uring is completion-based: userspace submits read/write requests to a ring buffer and the kernel posts completions when done — no syscall per operation. For high-throughput storage and networking, io_uring removes the 'system call per byte' tax. It's also harder to reason about and has had its share of CVEs.

What is kqueue and how does it compare to epoll?

kqueue is the BSD/macOS equivalent of epoll, but more general — a single kqueue can watch sockets, files, signals, timers, and process events through unified EVFILT_* filters. The performance characteristics are similar; the API is arguably cleaner. epoll only watches file descriptors on Linux; kqueue is a single events facility.

Why isn't poll just a faster select?

poll uses an array of pollfd structs instead of three FD bitmaps, so it removes the FD_SETSIZE limit and is slightly more cache-friendly. But it still scans every registered fd in the kernel on each call — same O(N) tax as select. It's more a portability and ergonomics improvement than a scaling fix.