Systems
Zero-Copy I/O
Two CPU copies you can stop paying for
Zero-copy I/O moves bytes between file descriptors without copying them through user-space buffers. Calls like sendfile, splice and io_uring cut the four-context-switch read+write loop down to one, doubling throughput on file servers and dropping CPU usage by 30–50% on saturated 10 GbE links.
- read+write copies4 (2 CPU, 2 DMA)
- sendfile copies2 (DMA only)
- User/kernel transitions4 → 2
- Throughput gain~2× on large files
- Linux since2.2 (sendfile), 5.1 (io_uring)
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How zero-copy works
The textbook way to send a file over a socket is two syscalls: read(file_fd, buf, n) followed by write(socket_fd, buf, n). It looks innocent. Underneath, the kernel does four data movements and four context switches per chunk:
- Disk → page cache (DMA). The disk controller drops bytes straight into kernel memory.
- Page cache → user buffer (CPU copy).
read()returns to user space; the CPU memcpy's every byte. - User buffer → socket buffer (CPU copy).
write()traps back into the kernel and the CPU memcpy's it again. - Socket buffer → NIC (DMA). The network card drops bytes onto the wire.
The two CPU copies are pure tax. The user-space process never reads the bytes — it just hands them off. Zero-copy syscalls let the kernel skip those copies by passing page references instead of page contents. sendfile(out_fd, in_fd, off, n) tells the kernel: "the file you already have in cache, ship it directly to that socket." With kTLS or kernel-side checksum, it does exactly that: page cache → NIC, no user-space round-trip.
The gain is biggest when the file is already hot in the page cache and the link is fast. On a 10 GbE NIC serving 1 MB static files, read/write tops out around 600 MB/s with one CPU pegged; sendfile sustains 1.1 GB/s with the same core 40% idle.
When to use zero-copy
- Static file servers, CDNs, video streaming — bytes flow from disk to socket untouched.
- Proxies that forward bytes verbatim —
splice+ pipe between sockets. - High-rate logging where the producer just dumps records —
vmspliceorio_uringwith registered buffers. - Anywhere the user-space process never inspects, rewrites, or compresses the payload.
You cannot use plain zero-copy when you need to transform bytes — gzip, image resize, template substitution, application-layer encryption. Linux kTLS is the workaround for HTTPS specifically; everything else still pays the round-trip.
Zero-copy variants compared
| Copies | Syscalls | Restrictions | Best for | |
|---|---|---|---|---|
| read + write | 4 (2 CPU) | 2 per chunk | None | Tiny files, transform pipelines |
| sendfile | 2 (DMA only) | 1 per chunk | in_fd must be a regular file; out_fd was socket-only pre-2.6.33 | Static HTTP, file → socket |
| splice | 0 in kernel (page refs) | 1–2 per chunk | One end must be a pipe | Socket → socket proxy via pipe |
| vmsplice | 0 (gifts user pages) | 1 | User pages must stay alive | Logging, batch ingest |
| MSG_ZEROCOPY (send) | 0 CPU on send | 1 + ack via errqueue | Async completion; user buffer pinned until kernel signals release | Large unidirectional sends |
| io_uring (with registered buffers) | 0 CPU | 0 (sqe ring) | Linux 5.1+; kernel polling for full no-syscall path | Database engines, async servers |
sendfile is the simplest win and is what nginx defaults to. splice generalizes the idea — pages move by reference inside the kernel — at the cost of needing a pipe in the middle. MSG_ZEROCOPY is the right choice when the source is already a user-space buffer (e.g. a memory-resident dataset). io_uring wraps all of these in a unified async interface; on Linux 6.x it can do hundreds of thousands of file→socket sends per second per core.
Python: sendfile
import os, socket
def serve_file(conn: socket.socket, path: str):
with open(path, "rb") as f:
size = os.fstat(f.fileno()).st_size
offset = 0
while offset < size:
sent = os.sendfile(conn.fileno(), f.fileno(), offset, size - offset)
if sent == 0:
break # EOF or socket closed
offset += sent
Python exposes os.sendfile directly. The socket.sendfile method also wraps it, falling back to read/send on platforms without kernel sendfile (Windows pre-IOCP). On Linux, an HTTP file server using this loop will saturate a 1 Gb link at roughly 5% CPU compared to 25–30% with the naive read/write equivalent.
Node.js: fs.copyFile fast path and stream pipelines
import { createReadStream } from 'node:fs';
import { createServer } from 'node:http';
import { pipeline } from 'node:stream/promises';
createServer(async (req, res) => {
res.writeHead(200, { 'Content-Type': 'video/mp4' });
// Node delegates to sendfile() under the hood when target is a TCP socket
// and the source is a plain file stream with no transforms in between.
await pipeline(createReadStream('./big.mp4'), res);
}).listen(8080);
Node's stream.pipeline calls into libuv, which on Linux uses sendfile when the source is a regular-file ReadStream piped to a TCP socket with no Transform in between. Insert any transform — gzip, custom slicer — and the fast path collapses back to read/write. fs.copyFile with the COPYFILE_FICLONE flag goes one better on copy-on-write filesystems (Btrfs, XFS reflinks, APFS): the kernel shares the extents and the "copy" is O(1) regardless of file size.
C: splice for socket-to-socket proxy
#include <fcntl.h>
#include <unistd.h>
// Forward bytes from in_sock to out_sock without ever touching them.
ssize_t splice_proxy(int in_sock, int out_sock, size_t len) {
int pipefd[2];
if (pipe(pipefd) < 0) return -1;
ssize_t total = 0;
while (len > 0) {
ssize_t n = splice(in_sock, NULL, pipefd[1], NULL, len, SPLICE_F_MOVE);
if (n <= 0) break;
ssize_t m = splice(pipefd[0], NULL, out_sock, NULL, n, SPLICE_F_MOVE);
if (m <= 0) break;
total += m;
len -= m;
}
close(pipefd[0]); close(pipefd[1]);
return total;
}
The pipe is the trick. splice requires one end to be a pipe so the kernel has somewhere to park page references between the two halves. The data never enters user space — only descriptors and lengths cross the syscall boundary. HAProxy uses this exact pattern; it's why a single HAProxy core can proxy 40 Gb/s of TCP traffic.
Costed claims
- Context switches: read + write does 4 user/kernel transitions per chunk; sendfile does 2; io_uring with kernel-side polling can do 0 amortized over a batch.
- CPU per gigabyte: on a Skylake server class CPU, a CPU memcpy of 1 GB at L3 bandwidth costs ~30 ms of CPU. Skipping two of them saves ~60 ms per GB transferred — a 5–8% throughput uplift on its own, more once cache pressure is factored in.
- Throughput on a saturated 10 GbE link: nginx benchmarks consistently show sendfile at ~1.1 GB/s vs ~0.6 GB/s for read/write on identical hardware, roughly 2× — that's where the "doubles throughput" rule of thumb comes from.
- Page size: sendfile and splice operate on 4 KB pages; large hugepages (2 MB) can amortize ref-count overhead further on multi-GB transfers.
Common bugs and edge cases
- Pre-2.6.33 sendfile was socket-only. Old code paths assumed
out_fdhad to be a socket; modern kernels accept any file. Distro-locked systems may still hit the old restriction. - splice direction confusion. The pipe end must match the data direction — read-side pipe fd for source, write-side pipe fd for sink. Mixing them returns
EINVALwith no clear hint. - MSG_ZEROCOPY user-buffer lifetime. The kernel takes a reference and only releases it via the socket error queue. If you reuse or free the buffer before the ack, you'll send garbage to the wire.
- kTLS feature gates. Older NICs don't offload TLS; the kernel falls back silently to software encryption, halving your gain. Check
tls_device_*stats in/proc/net. - Truncation during sendfile. If the underlying file is truncated mid-transfer, sendfile returns short and the connection sees a partial response. Application protocols need a length prefix or chunked framing to detect it.
- SIGBUS on tail pages. When the source is mmap-backed and the file shrinks, accessing the truncated page raises
SIGBUS. Sendfile gets the same fate; defensive servers either lock the file or copy the size up front.
Frequently asked questions
What does zero-copy actually copy zero of?
Zero refers to user-space copies. The kernel still reads from disk and writes to NIC, but it never copies the bytes into a process buffer and back out. DMA hardware moves them directly between page cache and the device.
Why does read() + write() do four copies?
Disk → kernel page cache (DMA), page cache → user buffer (CPU), user buffer → kernel socket buffer (CPU), socket buffer → NIC (DMA). The two CPU copies are pure overhead when the user never inspects the bytes.
Is sendfile() always faster than read()/write()?
For files larger than the L3 cache, yes — typically 2× throughput. For tiny files (under a few KB) the syscall overhead dominates and the gap shrinks. If you need to transform bytes (gzip, encrypt, template) you have to read them into user space anyway.
What's the difference between sendfile and splice?
sendfile is restricted to file → socket. splice works between any two descriptors as long as one end is a pipe — it moves pages by reference inside the kernel, so you can chain pipe-mediated splices for things like proxying.
Does TLS break zero-copy?
Plain sendfile, yes — the kernel can't encrypt for you. Linux's kTLS pushes encryption into the kernel so sendfile keeps working with HTTPS. Without kTLS you fall back to read + encrypt + write, four copies again.