What does zero-copy actually copy zero of?

Zero refers to user-space copies. The kernel still reads from disk and writes to NIC, but it never copies the bytes into a process buffer and back out. DMA hardware moves them directly between page cache and the device.

Why does read() + write() do four copies?

Disk → kernel page cache (DMA), page cache → user buffer (CPU), user buffer → kernel socket buffer (CPU), socket buffer → NIC (DMA). The two CPU copies are pure overhead when the user never inspects the bytes.

Is sendfile() always faster than read()/write()?

For files larger than the L3 cache, yes — typically 2× throughput. For tiny files (under a few KB) the syscall overhead dominates and the gap shrinks. If you need to transform bytes (gzip, encrypt, template) you have to read them into user space anyway.

What's the difference between sendfile and splice?

sendfile is restricted to file → socket. splice works between any two descriptors as long as one end is a pipe — it moves pages by reference inside the kernel, so you can chain pipe-mediated splices for things like proxying.

Does TLS break zero-copy?

Plain sendfile, yes — the kernel can't encrypt for you. Linux's kTLS pushes encryption into the kernel so sendfile keeps working with HTTPS. Without kTLS you fall back to read + encrypt + write, four copies again.

Zero-Copy I/O — How sendfile, splice & io_uring Skip User Space

How zero-copy works

The textbook way to send a file over a socket is two syscalls: read(file_fd, buf, n) followed by write(socket_fd, buf, n). It looks innocent. Underneath, the kernel does four data movements and four context switches per chunk:

Disk → page cache (DMA). The disk controller drops bytes straight into kernel memory.
Page cache → user buffer (CPU copy). read() returns to user space; the CPU memcpy's every byte.
User buffer → socket buffer (CPU copy). write() traps back into the kernel and the CPU memcpy's it again.
Socket buffer → NIC (DMA). The network card drops bytes onto the wire.

The two CPU copies are pure tax. The user-space process never reads the bytes — it just hands them off. Zero-copy syscalls let the kernel skip those copies by passing page references instead of page contents. sendfile(out_fd, in_fd, off, n) tells the kernel: "the file you already have in cache, ship it directly to that socket." With kTLS or kernel-side checksum, it does exactly that: page cache → NIC, no user-space round-trip.

The gain is biggest when the file is already hot in the page cache and the link is fast. On a 10 GbE NIC serving 1 MB static files, read/write tops out around 600 MB/s with one CPU pegged; sendfile sustains 1.1 GB/s with the same core 40% idle.

When to use zero-copy

Static file servers, CDNs, video streaming — bytes flow from disk to socket untouched.
Proxies that forward bytes verbatim — splice + pipe between sockets.
High-rate logging where the producer just dumps records — vmsplice or io_uring with registered buffers.
Anywhere the user-space process never inspects, rewrites, or compresses the payload.

You cannot use plain zero-copy when you need to transform bytes — gzip, image resize, template substitution, application-layer encryption. Linux kTLS is the workaround for HTTPS specifically; everything else still pays the round-trip.

Zero-copy variants compared

	Copies	Syscalls	Restrictions	Best for
read + write	4 (2 CPU)	2 per chunk	None	Tiny files, transform pipelines
sendfile	2 (DMA only)	1 per chunk	in_fd must be a regular file; out_fd was socket-only pre-2.6.33	Static HTTP, file → socket
splice	0 in kernel (page refs)	1–2 per chunk	One end must be a pipe	Socket → socket proxy via pipe
vmsplice	0 (gifts user pages)	1	User pages must stay alive	Logging, batch ingest
MSG_ZEROCOPY (send)	0 CPU on send	1 + ack via errqueue	Async completion; user buffer pinned until kernel signals release	Large unidirectional sends
io_uring (with registered buffers)	0 CPU	0 (sqe ring)	Linux 5.1+; kernel polling for full no-syscall path	Database engines, async servers

sendfile is the simplest win and is what nginx defaults to. splice generalizes the idea — pages move by reference inside the kernel — at the cost of needing a pipe in the middle. MSG_ZEROCOPY is the right choice when the source is already a user-space buffer (e.g. a memory-resident dataset). io_uring wraps all of these in a unified async interface; on Linux 6.x it can do hundreds of thousands of file→socket sends per second per core.

Python: sendfile

import os, socket

def serve_file(conn: socket.socket, path: str):
    with open(path, "rb") as f:
        size = os.fstat(f.fileno()).st_size
        offset = 0
        while offset < size:
            sent = os.sendfile(conn.fileno(), f.fileno(), offset, size - offset)
            if sent == 0:
                break  # EOF or socket closed
            offset += sent

Python exposes os.sendfile directly. The socket.sendfile method also wraps it, falling back to read/send on platforms without kernel sendfile (Windows pre-IOCP). On Linux, an HTTP file server using this loop will saturate a 1 Gb link at roughly 5% CPU compared to 25–30% with the naive read/write equivalent.

Node.js: fs.copyFile fast path and stream pipelines

import { createReadStream } from 'node:fs';
import { createServer } from 'node:http';
import { pipeline } from 'node:stream/promises';

createServer(async (req, res) => {
  res.writeHead(200, { 'Content-Type': 'video/mp4' });
  // Node delegates to sendfile() under the hood when target is a TCP socket
  // and the source is a plain file stream with no transforms in between.
  await pipeline(createReadStream('./big.mp4'), res);
}).listen(8080);

Node's stream.pipeline calls into libuv, which on Linux uses sendfile when the source is a regular-file ReadStream piped to a TCP socket with no Transform in between. Insert any transform — gzip, custom slicer — and the fast path collapses back to read/write. fs.copyFile with the COPYFILE_FICLONE flag goes one better on copy-on-write filesystems (Btrfs, XFS reflinks, APFS): the kernel shares the extents and the "copy" is O(1) regardless of file size.

C: splice for socket-to-socket proxy

#include <fcntl.h>
#include <unistd.h>

// Forward bytes from in_sock to out_sock without ever touching them.
ssize_t splice_proxy(int in_sock, int out_sock, size_t len) {
    int pipefd[2];
    if (pipe(pipefd) < 0) return -1;
    ssize_t total = 0;
    while (len > 0) {
        ssize_t n = splice(in_sock, NULL, pipefd[1], NULL, len, SPLICE_F_MOVE);
        if (n <= 0) break;
        ssize_t m = splice(pipefd[0], NULL, out_sock, NULL, n, SPLICE_F_MOVE);
        if (m <= 0) break;
        total += m;
        len -= m;
    }
    close(pipefd[0]); close(pipefd[1]);
    return total;
}

The pipe is the trick. splice requires one end to be a pipe so the kernel has somewhere to park page references between the two halves. The data never enters user space — only descriptors and lengths cross the syscall boundary. HAProxy uses this exact pattern; it's why a single HAProxy core can proxy 40 Gb/s of TCP traffic.

Costed claims

Context switches: read + write does 4 user/kernel transitions per chunk; sendfile does 2; io_uring with kernel-side polling can do 0 amortized over a batch.
CPU per gigabyte: on a Skylake server class CPU, a CPU memcpy of 1 GB at L3 bandwidth costs ~30 ms of CPU. Skipping two of them saves ~60 ms per GB transferred — a 5–8% throughput uplift on its own, more once cache pressure is factored in.
Throughput on a saturated 10 GbE link: nginx benchmarks consistently show sendfile at ~1.1 GB/s vs ~0.6 GB/s for read/write on identical hardware, roughly 2× — that's where the "doubles throughput" rule of thumb comes from.
Page size: sendfile and splice operate on 4 KB pages; large hugepages (2 MB) can amortize ref-count overhead further on multi-GB transfers.

Common bugs and edge cases

Pre-2.6.33 sendfile was socket-only. Old code paths assumed out_fd had to be a socket; modern kernels accept any file. Distro-locked systems may still hit the old restriction.
splice direction confusion. The pipe end must match the data direction — read-side pipe fd for source, write-side pipe fd for sink. Mixing them returns EINVAL with no clear hint.
MSG_ZEROCOPY user-buffer lifetime. The kernel takes a reference and only releases it via the socket error queue. If you reuse or free the buffer before the ack, you'll send garbage to the wire.
kTLS feature gates. Older NICs don't offload TLS; the kernel falls back silently to software encryption, halving your gain. Check tls_device_* stats in /proc/net.
Truncation during sendfile. If the underlying file is truncated mid-transfer, sendfile returns short and the connection sees a partial response. Application protocols need a length prefix or chunked framing to detect it.
SIGBUS on tail pages. When the source is mmap-backed and the file shrinks, accessing the truncated page raises SIGBUS. Sendfile gets the same fate; defensive servers either lock the file or copy the size up front.

Zero-Copy I/O

Interactive visualization

Watch the 60-second explainer

How zero-copy works

When to use zero-copy

Zero-copy variants compared

Python: sendfile

Node.js: fs.copyFile fast path and stream pipelines

C: splice for socket-to-socket proxy

Costed claims

Common bugs and edge cases

Frequently asked questions

Interactive visualization

Watch the 60-second explainer

How zero-copy works

When to use zero-copy

Zero-copy variants compared

Python: sendfile

Node.js: fs.copyFile fast path and stream pipelines

C: splice for socket-to-socket proxy

Costed claims

Common bugs and edge cases

Frequently asked questions

Related concepts