Operating Systems
epoll
O(1) event notification for tens of thousands of sockets — the engine behind nginx and Node.js
epoll is the Linux-specific scalable event notification mechanism — replacing the classical select/poll (O(n) per call) with O(1) per-event delivery. The application creates an epoll instance with epoll_create1(), registers file descriptors with epoll_ctl(EPOLL_CTL_ADD), and waits with epoll_wait() to receive only the FDs that have events ready. Two trigger modes: level-triggered (default) and edge-triggered (high-performance, requires draining the FD on each notification). Introduced in Linux 2.5.45 (2002). The engine behind nginx, HAProxy, Redis, Node.js libuv on Linux, and most modern C10K+ servers.
- Per-callO(1)
- Predecessorsselect (O(n)), poll (O(n))
- Trigger modesLT (level), ET (edge)
- IntroducedLinux 2.5.45 (2002)
- C10K solveryes
- Used innginx, HAProxy, Redis, libuv, Node.js
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why epoll matters
- Web servers. nginx serves 100,000+ concurrent connections per worker on a single core via epoll-edge-triggered loops. The C10K paper (1999, Dan Kegel) framed the problem; epoll was Linux's answer.
- Reverse proxies. HAProxy, Envoy, Traefik all use epoll on Linux. Per-connection cost is one entry in the kernel's red-black tree (~256 bytes) plus userspace state.
- Message brokers. Kafka brokers, RabbitMQ, NATS multiplex thousands of long-lived TCP sessions through a small number of I/O threads driven by epoll.
- In-memory databases. Redis is single-threaded for command execution but uses epoll to multiplex client sockets. It can handle 50,000–100,000 ops/sec on one core.
- Linux event loops. libuv (Node.js, Bun), tokio (Rust), asyncio (Python via uvloop), epoll is the bottom of every Linux async runtime.
- Reduced syscall pressure. A select-based loop on 10K FDs makes 10K-byte FD-set copies each iteration; an epoll loop makes one O(R) copy of just-ready FDs. CPU cache stays warm.
The API in three calls
The entire epoll surface is small enough to memorize:
int ep = epoll_create1(EPOLL_CLOEXEC);— creates a new epoll FD (an interest list plus a ready list, in kernel memory).struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = sock };epoll_ctl(ep, EPOLL_CTL_ADD, sock, &ev);— addssockto the interest list. Variants:MODto change events,DELto remove.struct epoll_event events[64];int n = epoll_wait(ep, events, 64, timeout_ms);— blocks until at least one FD is ready (or timeout); returns the number of ready entries.- Loop: for each
events[i], call the appropriateread()/write()/accept().
The kernel-side data structures are well-tuned: the interest list is a red-black tree keyed by FD (so epoll_ctl is O(log N)), and the ready list is a doubly-linked list updated by the wake-up path of every socket buffer when data arrives. epoll_wait moves the ready list to userspace under the epoll FD's mutex.
Level-triggered vs edge-triggered, in detail
Suppose 8 KB arrives on a socket and your buffer is 4 KB.
- Level-triggered.
epoll_waitreturns ready. Youread()4 KB. The remaining 4 KB still in the kernel buffer means the socket is still ready — nextepoll_waitreturns immediately with the same FD. Two wake-ups, two reads, simple to reason about. - Edge-triggered.
epoll_waitreturns ready. You mustread()in a loop untilEAGAIN/EWOULDBLOCK— that signals the socket's recv buffer is fully drained. One wake-up, two reads, no secondepoll_waitcall. Half the syscalls.
The trap with ET: if you stop reading before EAGAIN (because, say, you split a packet across calls and waited for "more later"), the FD will not fire again until new data arrives. Connections silently stall. Production code uses ET only with disciplined drain-to-EAGAIN loops, often with a state machine per connection.
Tuning for high FD counts
ulimit -n. Default 1024 on most distros. For 100K-connection servers, raise to 1048576 in/etc/security/limits.conf.fs.epoll.max_user_watches. Per-user cap on registered FDs across all epoll instances. Default scales with RAM (~96 KB per 1M watches); 1024 × num_cpus by default.net.core.somaxconn. Listen backlog. Default 4096 on modern kernels; raise to 65535 for accept-heavy servers.- Worker count. nginx runs one process per CPU core, each with its own epoll FD. Avoids thundering herd at
accept(); SO_REUSEPORT (Linux 3.9+) further parallelizes.
Common misconceptions
- "epoll = async I/O." No — epoll only tells you when an FD can read or write without blocking; the actual
read()/write()still happens synchronously in your thread. True async I/O is io_uring or POSIX AIO. The "non-blocking I/O" combined with epoll feels async but the data copy still happens in the calling thread. - "ET is always faster." Only when you handle the drain correctly and have high data volume per FD. For small messages with many FDs, LT is comparable and far less error-prone.
- "io_uring obsoletes epoll." Not yet. Most production traffic still flows through epoll. io_uring adoption is gated by security review fatigue (multiple CVEs in 2022–2023), kernel version requirements, and the fact that epoll is good enough for the bulk of HTTP/TCP workloads.
- "epoll works for any file." Regular files always report ready (POSIX semantics) — epoll cannot tell you when a disk read will block. For disk async, use io_uring or thread pools.
- "You need a separate epoll FD per thread." Sharing an epoll FD across threads is supported and useful with EPOLLEXCLUSIVE; sharding by hashing FDs to threads is also common.
- "epoll is portable." It is Linux-only. macOS/BSD have kqueue; Windows has IOCP/WSA. Cross-platform code uses libuv or libevent.
Minimal echo server
- Set the listening socket non-blocking (
fcntl O_NONBLOCK). epoll_create1(EPOLL_CLOEXEC), register listen FD withEPOLLIN | EPOLLET.- Loop on
epoll_wait: on listen FD,accept4()in a loop untilEAGAIN; register each accepted FD withEPOLLIN | EPOLLET. - On client FD:
read()in a loop untilEAGAIN; echo viawrite()(handleEAGAINon writes by registeringEPOLLOUT). - On hangup (
EPOLLRDHUP | EPOLLHUP):close()the FD; the kernel auto-removes it from the epoll set.
Frequently asked questions
Why is epoll O(1) and select O(n)?
select() and poll() require the caller to pass the entire FD set on every call; the kernel then walks all N descriptors to check readiness. Cost is O(N) per call regardless of how many are actually ready — for 10,000 idle sockets and one ready, you scan 10,000 entries. epoll inverts the model: epoll_ctl() registers FDs once into a kernel-side red-black tree, and the kernel maintains a ready list — a doubly-linked list of FDs that the kernel has flagged as having events. epoll_wait() walks only that ready list, copying just the ready entries to userspace. Cost is O(R) where R is the number of currently-ready FDs, independent of total registrations.
What's the difference between level-triggered and edge-triggered?
Level-triggered (LT, the default) reports an FD as ready as long as the condition holds — a socket with 100 unread bytes will keep showing up in epoll_wait results until you read all 100. It mimics select/poll semantics. Edge-triggered (ET, set with EPOLLET) reports an FD only on a state transition — when data first arrives. After the notification, the kernel will not notify again until more data arrives. This forces userspace to drain the FD completely on each notification (read in a loop until EAGAIN), but it eliminates redundant notifications, halving syscalls in high-throughput servers. nginx and most production proxies use ET.
When does ET save syscalls vs LT?
When the same FD has many bytes in flight. Under LT, a 64 KB recv buffer drained 16 KB at a time produces four epoll_wait wakeups for that FD. Under ET, you receive one wakeup, loop calling recv() until it returns EAGAIN, and stay in userspace the whole time — one syscall pair (epoll_wait + N reads) instead of four. For a server handling 100,000 connections at high data rate, ET typically halves system CPU time. The cost is a stricter programming model: forget to drain to EAGAIN once and that connection silently stalls forever.
How does epoll relate to kqueue (BSD) and IOCP (Windows)?
Same problem, three OS-specific solutions. kqueue (FreeBSD, macOS) was actually first (2000) and is more general — it can monitor file events, signals, timers, and process exits in one mechanism. epoll (Linux 2002) is socket/pipe-focused but cleaner on the readiness model. IOCP (Windows) is a different model entirely: completion-based rather than readiness-based — the kernel does the I/O and notifies you when it's done, similar to io_uring. Cross-platform libraries like libuv, libevent, and Boost.Asio abstract over all three; on Linux they pick epoll, on macOS kqueue, on Windows IOCP.
What is io_uring and is it replacing epoll?
io_uring (Linux 5.1, 2019) is a true async I/O interface using shared-memory submission and completion ring buffers — userspace places I/O requests in the SQ ring, the kernel completes them and posts results to the CQ ring, with no syscall per operation. Unlike epoll (which only signals readiness), io_uring lets the kernel do the read/write itself. For raw throughput on NVMe and 100 Gbps NICs, io_uring beats epoll by 2-5x. Adoption is real (nginx 1.25+ supports it, RocksDB, Ceph) but slow because epoll is good enough for the C10K-class problem and io_uring requires new code paths plus security audits — Google disabled it on production for CVE risk in 2023.
Why does epoll need EPOLLONESHOT for multi-threaded readers?
If thread A and thread B both call epoll_wait() on the same epoll FD, and an event fires on socket S, both can wake up — the classic thundering herd. EPOLLONESHOT registers an FD such that after a single event delivery, the FD is automatically disabled until userspace re-arms it via epoll_ctl(EPOLL_CTL_MOD). This guarantees only one thread handles each event. The pattern: thread receives event, processes it, re-arms via EPOLL_CTL_MOD with the desired event mask. Adds one syscall per event but eliminates lock contention between worker threads. EPOLLEXCLUSIVE (kernel 4.5+) is a related newer flag that wakes only one waiter without requiring re-arm.