Systems
Copy-on-Write
Pretend to copy. Pay for it only if someone writes.
Copy-on-write (CoW) shares one underlying buffer between two logical owners and lazily duplicates it the moment either attempts to mutate. The kernel uses it to make fork() cheap. Filesystems like ZFS and Btrfs use it to make snapshots free. Language runtimes used to use it for strings, until thread-safety made it unprofitable. The idea is the same everywhere; the failure modes — fork-explosion in big-heap apps, write amplification on COW filesystems — are very different.
- COW page fault~1 µs
- Eager-copy per page~1 µs
- x86 page size4 KB
- fork() (small process)~50 µs
- ZFS snapshot createO(1), microseconds
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The mechanism
Two parties want their "own copy" of a buffer. The eager solution memcpys at assignment — proportional to size even if neither side mutates. The lazy alternative:
- Make both parties point at the same physical buffer.
- Mark it read-only (or refcount it).
- On a write attempt, allocate a private buffer for the writer, copy the contents, let the write proceed.
If most copies are never modified — a fresh process that immediately execs, a parent and child sharing pages, snapshots that aren't re-read — this turns a linear cost into O(1). If most copies are modified, you've added overhead (page fault, allocator call, copy) without saving anything.
fork() — the textbook case
On Linux, fork() doesn't duplicate memory. It allocates a new task_struct, clones the parent's mm_struct, walks the page tables copying entries while clearing the write bit on every writable PTE in both parent and child, and returns. RSS-sized memory copied: zero.
The first write to a page in either process traps with a write-protection fault. The kernel's COW handler (do_wp_page) checks the refcount: if 1, restore the write bit; if > 1, allocate a fresh page, memcpy, point the faulting process at the new page, restore the write bit, return. ~5 µs, invisible to userspace.
// fork_cow.c — gcc -O2 fork_cow.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/wait.h>
#include <time.h>
int main(void) {
size_t bytes = 256 << 20;
char* buf = malloc(bytes); memset(buf, 'a', bytes); // 256 MB resident
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
pid_t pid = fork();
clock_gettime(CLOCK_MONOTONIC, &t1);
if (pid == 0) {
printf("fork returned in %.3f ms (zero pages copied)\n",
(t1.tv_sec-t0.tv_sec)*1e3 + (t1.tv_nsec-t0.tv_nsec)/1e6);
buf[0] = 'b'; // COWs exactly one 4 KB page
return 0;
}
wait(NULL);
}
Typical output: fork returned in 2.4 ms. A 256 MB heap "duplicated" in milliseconds. Eager-copying 256 MB at modern memcpy speeds (~10 GB/s) would take ~25 ms — an order of magnitude more, and that assumes a single contiguous copy. Notice how the actual page-by-page work is shifted to whichever process happens to write each page first.
The Python equivalent reveals an interesting wrinkle:
import os, gc, resource
# 100 MB of zero bytes — fork is fast
big = bytearray(100 * 1024 * 1024)
pid = os.fork()
if pid == 0:
# CPython's reference counts live next to the object — every read of `big`
# increments a refcount, dirtying the page, COWing 4 KB. Reading half a
# GB of data in the child can copy half a GB even if you don't 'modify' it.
s = sum(big[::4096])
print("RSS:", resource.getrusage(resource.RUSAGE_SELF).ru_maxrss, "KB")
os._exit(0)
os.waitpid(pid, 0)
This is the classic CPython COW gotcha: garbage collectors and refcounts look like reads but require writes. Forking a Python worker pool with a large pre-warmed model rapidly turns the shared cache into per-worker private copies. The same applies to JVM and V8 mark-phase GC — touching every object marks it.
Copy-on-write strings — the cautionary tale
From the late 1990s through GCC 4.x, std::string in libstdc++ was COW-based. s2 = s1; bumped a refcount; the first call to a non-const s2[i] triggered the actual copy.
The pattern broke under threading. Two threads holding their "own" string share a refcount, so every assignment, copy, and destruction must atomically inc/dec it — even when nothing is mutated. The cost of atomicity per touch (~25 cycles) outweighed savings for short strings, and C++11 required that iterators survive concurrent const access — which a refcounted COW design cannot guarantee. libstdc++ kept COW until GCC 5.1 (2015) for ABI; libc++ never had it. Modern strings use small-string optimization (inline ~22 chars) plus eager allocation past that.
You can still opt into COW with explicit shared types:
#include <memory>
#include <string>
class CowString {
std::shared_ptr<std::string> data_; // shared until written
public:
CowString(std::string s) : data_(std::make_shared<std::string>(std::move(s))) {}
const std::string& read() const { return *data_; }
std::string& write() { // call before mutating
if (data_.use_count() > 1) {
data_ = std::make_shared<std::string>(*data_); // detach
}
return *data_;
}
};
This is the explicit form most modern systems prefer: COW as a deliberate, opt-in pattern in user code rather than baked into a fundamental type.
CoW vs eager copy across the stack
| Layer | Eager copy | Copy-on-write | Best for | Cost |
|---|---|---|---|---|
| fork() heap dup | vfork / direct copy | Default Linux fork | fork+exec, snapshots | ~1 µs/page COWed |
| clone / posix_spawn | posix_spawn skips memory | CLONE_VM shares | Short-lived helpers | ~50 µs full |
| mmap MAP_PRIVATE | + msync | Faults on write | Read-mostly files | 0 until write |
| overlayfs (Docker) | Full image copy | Lower/upper layered | Container layers | O(written files) |
| Redis BGSAVE | Stop-the-world copy | fork + COW per modified page | Background persistence | Up to 2× peak RSS |
| ZFS / Btrfs | In-place rewrite | Out-of-place + tree update | Snapshots, clones | ~1 extra block / write |
| std::string | SSO + eager | Refcount + atomic per op | Read-mostly large strings | SSO ≪ COW for short |
| Persistent (Clojure) | Defensive copy | Path copying, shared subtrees | Functional, undo stacks | O(log n) per op |
Variants
Refcount snapshots vs full COW
A "refcount snapshot" copies just the top-level pointer and bumps a count — anything reachable through that pointer is shared. Subsequent mutations must check the count and detach. Cheap and simple but doesn't compose: every node along a deeply mutated path is touched. Full path-copying COW (Clojure persistent vectors, Git's tree objects) copies all nodes from the mutation point to the root, leaving the rest shared. O(log n) per write instead of O(n) detach.
COW filesystems
- ZFS: every block is written to a new location and parent pointers updated up to the uberblock. Snapshots cost almost nothing; defragmentation costs a lot.
- Btrfs: same model, plus reflinks (
cp --reflink) — share a file's extents until either side writes, paying COW per modified extent. - APFS: Apple's COW filesystem;
cpon a modern Mac is reflinked by default, which is why duplicating a 50 GB Xcode install is instant. - XFS / ext4 reflink: opt-in COW for individual files via
ioctl(FICLONE).
Process-level COW
fork() on Linux is the default; posix_spawn skips COW entirely (cheaper for fork+exec). vfork shares the parent's address space and suspends the parent until exec — historically faster than COW fork, but a footgun. Modern fast-fork research (Hu et al, OSDI '24) explores skipping page-table copy for the common fork+exec case.
Common pitfalls
- Fork copy-on-fork explosion. A 4 GB Redis worker forks for BGSAVE; the parent keeps writing. Every modified page COWs to a private child copy. Peak RSS can approach 2×. Tune
vm.overcommit_memory=1, watchpage-faultsinperf. - GC + COW = death. Mark-and-sweep collectors that visit every object dirty every page. A forked V8/JVM/CPython process can copy nearly the whole heap on its next GC cycle. Pre-fork before the heap fills, or use
posix_spawn+ IPC. - Madvise gotcha.
MADV_DONTFORKtells fork to skip a region. Forgetting it forces page-table duplication for memory the child will never touch. - Refcount thread-safety. Inc/dec of the COW counter must be atomic. Otherwise you get use-after-free when one thread "drops the last reference" while another reads.
- Filesystem fragmentation. COW filesystems write out-of-place; long-lived databases on ZFS/Btrfs fragment heavily. Schedule
scrub/balance. - Snapshot retention exhaustion. Each retained ZFS snapshot pins its blocks. Forgetting to prune fills the pool while live data appears small.
- Detach race.
shared_ptr::use_count() == 1is only a hint without external synchronization — between check and write, another reader could acquire a reference. Usecompare_exchangeor a lock during detach.
When CoW is the right tool
Reach for CoW when reads vastly outnumber writes (snapshots, fork-then-exec, shared immutable structures), and the deferred copy is bounded by something natural — a page, an extent, a tree node. Avoid it when nearly every "copy" will be modified, when refcount overhead would be hot in a tight loop, or when worst-case memory pressure must stay below 1× working set.
Frequently asked questions
What is copy-on-write?
Copy-on-write is the strategy of letting two logical 'copies' share the same underlying physical storage until one of them is mutated. The actual copy happens lazily, at the moment of the first write. It's used in fork() (kernel), reference-counted strings (language runtimes), and snapshot filesystems (ZFS, Btrfs).
How does fork() use copy-on-write?
When fork() creates a child, the kernel does not duplicate the parent's pages. Instead it copies the page tables and marks every writable page read-only in both parent and child. The first write to a page in either process triggers a page fault; the kernel allocates a fresh page, copies the contents, and updates the offending page table. Pages never written are never duplicated.
Why is fork() expensive on a process with a big heap?
Two reasons. First, the kernel must walk and duplicate every page-table entry — proportional to virtual memory size, not RSS. Second, garbage collectors and memory allocators that touch every page (mark phases, defragmentation) trigger COW faults on each, eventually copying nearly the whole heap. Redis famously hit this when an active BGSAVE plus heavy writes doubled memory consumption.
Are C++ std::strings still copy-on-write?
No. C++11 explicitly forbade COW std::string because it can't be made thread-safe without atomics on every refcount touch — slower than just copying for short strings. GCC's libstdc++ kept COW until 5.1 (2015) for ABI reasons; libc++ never had it. Most languages with immutable strings (Python, Java) use sharing-by-reference for substrings instead.
How does ZFS use copy-on-write?
Every block written to ZFS is written to a fresh location, never overwritten in place. The block's parent pointer is then updated, also out-of-place, all the way up to the root of the tree. A snapshot is just a saved copy of the old root — instant, near-zero cost. The price is write amplification (more I/O per logical write) and fragmentation.
How long does a COW page fault take compared with an immediate copy?
A COW page fault costs ~1 µs on Linux: trap into the kernel, allocate a page, copy 4 KB, update the page table, return. Immediate copy at fork time avoids the fault but costs ~1 µs per page anyway, multiplied by RSS. COW wins big when most pages are never written; loses when the child writes most of the inherited heap.