Systems

Zero-Copy I/O

Two CPU copies you can stop paying for

Zero-copy I/O moves bytes between file descriptors without copying them through user-space buffers. Calls like sendfile, splice and io_uring cut the four-context-switch read+write loop down to one, doubling throughput on file servers and dropping CPU usage by 30–50% on saturated 10 GbE links.

  • read+write copies4 (2 CPU, 2 DMA)
  • sendfile copies2 (DMA only)
  • User/kernel transitions4 → 2
  • Throughput gain~2× on large files
  • Linux since2.2 (sendfile), 5.1 (io_uring)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How zero-copy works

The textbook way to send a file over a socket is two syscalls: read(file_fd, buf, n) followed by write(socket_fd, buf, n). It looks innocent. Underneath, the kernel does four data movements and four context switches per chunk:

  1. Disk → page cache (DMA). The disk controller drops bytes straight into kernel memory.
  2. Page cache → user buffer (CPU copy). read() returns to user space; the CPU memcpy's every byte.
  3. User buffer → socket buffer (CPU copy). write() traps back into the kernel and the CPU memcpy's it again.
  4. Socket buffer → NIC (DMA). The network card drops bytes onto the wire.

The two CPU copies are pure tax. The user-space process never reads the bytes — it just hands them off. Zero-copy syscalls let the kernel skip those copies by passing page references instead of page contents. sendfile(out_fd, in_fd, off, n) tells the kernel: "the file you already have in cache, ship it directly to that socket." With kTLS or kernel-side checksum, it does exactly that: page cache → NIC, no user-space round-trip.

The gain is biggest when the file is already hot in the page cache and the link is fast. On a 10 GbE NIC serving 1 MB static files, read/write tops out around 600 MB/s with one CPU pegged; sendfile sustains 1.1 GB/s with the same core 40% idle.

When to use zero-copy

  • Static file servers, CDNs, video streaming — bytes flow from disk to socket untouched.
  • Proxies that forward bytes verbatim — splice + pipe between sockets.
  • High-rate logging where the producer just dumps records — vmsplice or io_uring with registered buffers.
  • Anywhere the user-space process never inspects, rewrites, or compresses the payload.

You cannot use plain zero-copy when you need to transform bytes — gzip, image resize, template substitution, application-layer encryption. Linux kTLS is the workaround for HTTPS specifically; everything else still pays the round-trip.

Zero-copy variants compared

CopiesSyscallsRestrictionsBest for
read + write4 (2 CPU)2 per chunkNoneTiny files, transform pipelines
sendfile2 (DMA only)1 per chunkin_fd must be a regular file; out_fd was socket-only pre-2.6.33Static HTTP, file → socket
splice0 in kernel (page refs)1–2 per chunkOne end must be a pipeSocket → socket proxy via pipe
vmsplice0 (gifts user pages)1User pages must stay aliveLogging, batch ingest
MSG_ZEROCOPY (send)0 CPU on send1 + ack via errqueueAsync completion; user buffer pinned until kernel signals releaseLarge unidirectional sends
io_uring (with registered buffers)0 CPU0 (sqe ring)Linux 5.1+; kernel polling for full no-syscall pathDatabase engines, async servers

sendfile is the simplest win and is what nginx defaults to. splice generalizes the idea — pages move by reference inside the kernel — at the cost of needing a pipe in the middle. MSG_ZEROCOPY is the right choice when the source is already a user-space buffer (e.g. a memory-resident dataset). io_uring wraps all of these in a unified async interface; on Linux 6.x it can do hundreds of thousands of file→socket sends per second per core.

Python: sendfile

import os, socket

def serve_file(conn: socket.socket, path: str):
    with open(path, "rb") as f:
        size = os.fstat(f.fileno()).st_size
        offset = 0
        while offset < size:
            sent = os.sendfile(conn.fileno(), f.fileno(), offset, size - offset)
            if sent == 0:
                break  # EOF or socket closed
            offset += sent

Python exposes os.sendfile directly. The socket.sendfile method also wraps it, falling back to read/send on platforms without kernel sendfile (Windows pre-IOCP). On Linux, an HTTP file server using this loop will saturate a 1 Gb link at roughly 5% CPU compared to 25–30% with the naive read/write equivalent.

Node.js: fs.copyFile fast path and stream pipelines

import { createReadStream } from 'node:fs';
import { createServer } from 'node:http';
import { pipeline } from 'node:stream/promises';

createServer(async (req, res) => {
  res.writeHead(200, { 'Content-Type': 'video/mp4' });
  // Node delegates to sendfile() under the hood when target is a TCP socket
  // and the source is a plain file stream with no transforms in between.
  await pipeline(createReadStream('./big.mp4'), res);
}).listen(8080);

Node's stream.pipeline calls into libuv, which on Linux uses sendfile when the source is a regular-file ReadStream piped to a TCP socket with no Transform in between. Insert any transform — gzip, custom slicer — and the fast path collapses back to read/write. fs.copyFile with the COPYFILE_FICLONE flag goes one better on copy-on-write filesystems (Btrfs, XFS reflinks, APFS): the kernel shares the extents and the "copy" is O(1) regardless of file size.

C: splice for socket-to-socket proxy

#include <fcntl.h>
#include <unistd.h>

// Forward bytes from in_sock to out_sock without ever touching them.
ssize_t splice_proxy(int in_sock, int out_sock, size_t len) {
    int pipefd[2];
    if (pipe(pipefd) < 0) return -1;
    ssize_t total = 0;
    while (len > 0) {
        ssize_t n = splice(in_sock, NULL, pipefd[1], NULL, len, SPLICE_F_MOVE);
        if (n <= 0) break;
        ssize_t m = splice(pipefd[0], NULL, out_sock, NULL, n, SPLICE_F_MOVE);
        if (m <= 0) break;
        total += m;
        len -= m;
    }
    close(pipefd[0]); close(pipefd[1]);
    return total;
}

The pipe is the trick. splice requires one end to be a pipe so the kernel has somewhere to park page references between the two halves. The data never enters user space — only descriptors and lengths cross the syscall boundary. HAProxy uses this exact pattern; it's why a single HAProxy core can proxy 40 Gb/s of TCP traffic.

Costed claims

  • Context switches: read + write does 4 user/kernel transitions per chunk; sendfile does 2; io_uring with kernel-side polling can do 0 amortized over a batch.
  • CPU per gigabyte: on a Skylake server class CPU, a CPU memcpy of 1 GB at L3 bandwidth costs ~30 ms of CPU. Skipping two of them saves ~60 ms per GB transferred — a 5–8% throughput uplift on its own, more once cache pressure is factored in.
  • Throughput on a saturated 10 GbE link: nginx benchmarks consistently show sendfile at ~1.1 GB/s vs ~0.6 GB/s for read/write on identical hardware, roughly 2× — that's where the "doubles throughput" rule of thumb comes from.
  • Page size: sendfile and splice operate on 4 KB pages; large hugepages (2 MB) can amortize ref-count overhead further on multi-GB transfers.

Common bugs and edge cases

  • Pre-2.6.33 sendfile was socket-only. Old code paths assumed out_fd had to be a socket; modern kernels accept any file. Distro-locked systems may still hit the old restriction.
  • splice direction confusion. The pipe end must match the data direction — read-side pipe fd for source, write-side pipe fd for sink. Mixing them returns EINVAL with no clear hint.
  • MSG_ZEROCOPY user-buffer lifetime. The kernel takes a reference and only releases it via the socket error queue. If you reuse or free the buffer before the ack, you'll send garbage to the wire.
  • kTLS feature gates. Older NICs don't offload TLS; the kernel falls back silently to software encryption, halving your gain. Check tls_device_* stats in /proc/net.
  • Truncation during sendfile. If the underlying file is truncated mid-transfer, sendfile returns short and the connection sees a partial response. Application protocols need a length prefix or chunked framing to detect it.
  • SIGBUS on tail pages. When the source is mmap-backed and the file shrinks, accessing the truncated page raises SIGBUS. Sendfile gets the same fate; defensive servers either lock the file or copy the size up front.

Frequently asked questions

What does zero-copy actually copy zero of?

Zero refers to user-space copies. The kernel still reads from disk and writes to NIC, but it never copies the bytes into a process buffer and back out. DMA hardware moves them directly between page cache and the device.

Why does read() + write() do four copies?

Disk → kernel page cache (DMA), page cache → user buffer (CPU), user buffer → kernel socket buffer (CPU), socket buffer → NIC (DMA). The two CPU copies are pure overhead when the user never inspects the bytes.

Is sendfile() always faster than read()/write()?

For files larger than the L3 cache, yes — typically 2× throughput. For tiny files (under a few KB) the syscall overhead dominates and the gap shrinks. If you need to transform bytes (gzip, encrypt, template) you have to read them into user space anyway.

What's the difference between sendfile and splice?

sendfile is restricted to file → socket. splice works between any two descriptors as long as one end is a pipe — it moves pages by reference inside the kernel, so you can chain pipe-mediated splices for things like proxying.

Does TLS break zero-copy?

Plain sendfile, yes — the kernel can't encrypt for you. Linux's kTLS pushes encryption into the kernel so sendfile keeps working with HTTPS. Without kTLS you fall back to read + encrypt + write, four copies again.