Operating Systems
Page Fault & Demand Paging
Memory you never paid for until you touched it
A page fault is a CPU trap into the kernel when a process touches a virtual page with no valid physical backing. Demand paging exploits this: pages are loaded from disk only on first access, so memory is allocated lazily instead of up front.
- Minor fault~0.1–1 µs (no I/O)
- Major fault (SSD)~50–150 µs
- Major fault (HDD)~5–10 ms
- Typical page size4 KB
- Allocation strategyLazy / on-demand
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a page fault works
When a program reads or writes memory, it uses a virtual address. The memory management unit (MMU) walks the page table to translate that virtual address into a physical RAM address. If the page table entry is missing or marked not-present, the MMU can't complete the translation — so it raises a hardware exception called a page fault and hands control to the kernel.
This is not an error. It is a control-flow mechanism. The faulting instruction is frozen mid-execution, the CPU saves the faulting address and the access type (read/write/execute, user/kernel) in registers like x86's CR2 and the error code, and jumps to the kernel's page-fault handler. The handler's job is to decide one of three things:
- The access is legal and the page is recoverable — allocate or fetch the page, fix the page table entry, flush the stale TLB entry, return, and let the CPU re-run the faulting instruction. The program never knows it happened.
- The access is legal but expensive — the page lives on disk or in swap. The kernel issues I/O, may put the process to sleep, and resumes it once the page arrives. This is a major fault.
- The access is illegal — the address is outside every mapped region, or it's a write to a read-only page. The kernel delivers
SIGSEGV. This is the segmentation fault you've cursed at.
Demand paging is the policy built on top of this trap: don't load anything until the program actually touches it. When you start a process, the kernel maps the executable and libraries into the address space but loads almost nothing into RAM. Each page faults in the first time it runs, so a 200 MB binary that only exercises 12 MB of code paths only ever brings 12 MB off disk. The fault is the loader.
Minor vs major faults — the 10,000× cliff
The single most important distinction in this whole topic is minor versus major, because they differ by four to five orders of magnitude in cost.
A minor (soft) fault resolves without any disk I/O. The data is already sitting in physical RAM — the kernel just has to wire up a page table entry. Common causes:
- The page is a shared library already loaded for another process (the page cache has it).
- A freshly allocated heap page that gets a fresh zeroed frame on first write (or the shared zero page on first read).
- A copy-on-write page that needs a private copy made.
A minor fault costs on the order of 0.1 to 1 microsecond — a few thousand cycles of kernel bookkeeping.
A major (hard) fault requires reading the page from disk or swap. On an NVMe SSD that's roughly 50–150 microseconds; on a spinning disk, a seek plus rotational latency runs 5–10 milliseconds. That HDD figure is around 10,000× slower than a minor fault and tens of millions of times slower than the L1-cache hit the program expected. One stray major fault inside a hot loop can dominate the runtime of an entire request.
On Linux you can watch the split with /usr/bin/time -v ./program, which prints "Major (requiring I/O) page faults" and "Minor (reclaiming a frame) page faults" separately, or with ps -o min_flt,maj_flt.
The page-fault handler, step by step
Here is the canonical path for a fault on Linux, end to end:
1. CPU: instruction touches addr → MMU page-walk fails
2. CPU: trap to kernel; save faulting addr (CR2), error code, regs
3. Handler: find the VMA covering addr (find_vma)
└─ none found, or access violates VMA perms → SIGSEGV ✗
4. Legal access. What kind of page?
├─ anonymous, never written → map shared zero page (minor)
├─ copy-on-write, write hit → alloc frame, copy 4 KB (minor)
├─ file-backed, in page cache → just map it (minor)
└─ file-backed / swapped out → issue disk read (MAJOR)
└─ sleep process, schedule someone else
└─ I/O completes → wake, install PTE
5. Install page table entry, flush stale TLB entry for addr
6. Return from trap; CPU RE-EXECUTES the faulting instruction
7. This time the page-walk succeeds. Program runs on.
The detail that surprises people: the faulting instruction is not skipped or emulated — it is re-executed from scratch. The hardware makes the fault precise, meaning architectural state is exactly as if the instruction never started. That's why a single load can fault, sleep for 10 ms, and then complete as though nothing happened.
Fault types at a glance
| Trigger | Classification | Disk I/O? | Typical cost | Outcome |
|---|---|---|---|---|
| First write to malloc'd page | Minor | No | ~0.5 µs | Zero page mapped |
| Shared .so already in page cache | Minor | No | ~0.3 µs | PTE wired to cached frame |
| Write to a COW page after fork() | Minor | No | ~0.8 µs | Private 4 KB copy made |
| Read code page not yet loaded | Major | Yes (read) | ~80 µs SSD | Page read from binary on disk |
| Access a page swapped out earlier | Major | Yes (read) | ~80 µs–8 ms | Page read back from swap |
| Write to read-only mapping | Invalid (protection) | No | — | SIGSEGV delivered |
| Dereference unmapped address | Invalid | No | — | SIGSEGV delivered |
The first three rows are why a healthy program faults thousands of times per second and you never notice. The last two are the only rows that surface to the developer — and they masquerade as totally different problems (a slow request vs. a crash) despite sharing the exact same hardware mechanism.
What the numbers actually say
- Lazy allocation is near-free.
malloc(1<<30)returns in well under a microsecond and consumes zero physical pages. Memory is billed one 4 KB minor fault at a time, only as you write. This is whytopshows VIRT far larger than RES. - Page-in granularity is one page, not one byte. Touch a single byte of a fresh page and you fault in all 4 KB. With 2 MB transparent huge pages, one fault brings in 2 MB — fewer faults and TLB entries, but more wasted RAM if the access is sparse.
- Readahead amortizes major faults. Linux prefetches adjacent file pages (typically up to 128 KB) on a fault, so sequential scans pay one major fault per ~32 pages instead of per page. Random access defeats this entirely.
- A 4 GB fork() costs kilobytes, not gigabytes. Copy-on-write means the child shares all pages until written. A child that execs immediately copies almost nothing — which is why
fork()+exec()on a huge process is cheap. - Thrashing is a cliff, not a slope. While the working set fits in RAM, faults are minor and throughput is flat. Cross the line and major faults dominate; observed throughput can fall by 100–1000× over a few percent more memory pressure.
JavaScript: modeling the fault path
You can't trigger a real hardware fault from JS, but you can model the handler's decision tree and the dramatic cost asymmetry — useful for reasoning about a working set:
// Costs in microseconds — the whole point is the 10,000x gap.
const COST = { hit: 0.0, minor: 0.5, majorSSD: 80, majorHDD: 6000 };
class DemandPager {
constructor(ramFrames) {
this.capacity = ramFrames; // physical frames available
this.resident = new Set(); // pages currently in RAM
this.lru = []; // eviction order, oldest first
this.stats = { hit: 0, minor: 0, major: 0, micros: 0 };
}
// Returns the cost in microseconds of touching `page`.
access(page, { onDisk = true } = {}) {
if (this.resident.has(page)) { // translation succeeds
this.touch(page);
this.stats.hit++;
return COST.hit;
}
// Page fault. Evict if RAM is full (demand paging never preallocates).
if (this.resident.size >= this.capacity) this.evict();
this.resident.add(page);
this.touch(page);
if (onDisk) { this.stats.major++; this.stats.micros += COST.majorSSD; return COST.majorSSD; }
this.stats.minor++; this.stats.micros += COST.minor; return COST.minor; // zero page / COW
}
evict() {
const victim = this.lru.shift(); // least-recently used
this.resident.delete(victim);
}
touch(page) {
const i = this.lru.indexOf(page);
if (i !== -1) this.lru.splice(i, 1);
this.lru.push(page);
}
}
const pager = new DemandPager(4); // only 4 frames of RAM
[1, 2, 3, 4, 1, 5, 1, 3].forEach(p => pager.access(p));
console.log(pager.stats);
// { hit: 3, minor: 0, major: 5, micros: 400 }
// page 5 evicted page 2 (LRU); a later touch of 2 would fault again.
The model makes the working-set point concrete: shrink ramFrames below the number of distinct hot pages and every reuse becomes a major fault — simulated thrashing.
Python: measuring real faults
Python can observe the real thing through the OS. Here we force a major fault by reading a freshly-dropped file page and watch the counter move:
import mmap, os, resource, ctypes, ctypes.util
PAGE = resource.getpagesize() # 4096 on most systems
def faults():
r = resource.getrusage(resource.RUSAGE_SELF)
return r.ru_minflt, r.ru_majflt
# 1) Lazy allocation: reserve 256 MB, touch nothing -> ~no faults yet.
buf = mmap.mmap(-1, 256 * 1024 * 1024) # anonymous, demand-zero
before = faults()
# 2) Touch one byte in every page -> one MINOR fault per page (zero page).
for off in range(0, len(buf), PAGE):
buf[off] = 1
after = faults()
print("minor faults from touching 256 MB:",
after[0] - before[0]) # ~65536 (256MB / 4KB)
print("major faults (no disk I/O needed):",
after[1] - before[1]) # 0
# 3) Force a MAJOR fault: map a file, drop caches, then read it.
with open("/etc/hostname", "rb") as f:
# ACCESS_COPY = copy-on-write: never writes the file, but gives a
# *writable* buffer so ctypes.from_buffer() will accept it below.
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY)
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
libc.madvise(ctypes.c_void_p(ctypes.addressof(
ctypes.c_char.from_buffer(mm))), len(mm), 4) # MADV_DONTNEED = 4
b0 = faults()
_ = mm[0] # touch -> may be a major fault
print("after re-touching evicted file page:", faults()[1] - b0[1])
Run it under /usr/bin/time -v and the kernel's own major/minor tallies should match what getrusage reports. The 65,536 minor faults from step 2 are the literal cost of demand-zeroing a quarter-gigabyte: one fault per 4 KB page, each mapping the shared zero page until written.
Variants and policies worth knowing
Pure demand paging vs prepaging. Pure demand paging loads exactly the pages touched. Prepaging (readahead) loads neighbors too, betting on locality. It wins on sequential access, wastes I/O and RAM on random access. Linux tunes its readahead window dynamically based on observed access patterns.
Copy-on-write (COW). The most important application of protection faults. fork() shares all pages read-only; the first write faults and copies just that page. Same trick powers efficient snapshots and memory deduplication (KSM).
Demand-zero pages. Anonymous memory (heap, stack, BSS) starts mapped to a single shared, read-only page of zeros. Reads are free; the first write faults and substitutes a private, writable, zeroed frame. This is why calloc of a huge buffer is nearly instant.
Swapping vs demand paging. Classic swapping moves whole processes between RAM and disk. Demand paging is the finer-grained successor: individual pages move, and only when the fault path or the reclaim path forces it. Modern systems also compress pages in RAM (zswap/zram) to avoid disk entirely.
Huge pages. 2 MB or 1 GB pages mean one fault covers far more memory and consumes fewer TLB entries — great for databases with large, dense working sets, but they inflate memory use when access is sparse and complicate the fault handler's allocation.
Common bugs and edge cases
- Confusing VIRT with real usage. A process can show 50 GB of virtual size while using 200 MB of RAM. Demand paging means virtual reservations are free; only resident set size (RSS) is real memory. Alarms keyed on VIRT fire constantly and mean nothing.
- Latency spikes from cold pages. A request that's normally 200 µs occasionally takes 10 ms because it touched a swapped-out or never-loaded page. The fix for latency-critical services is
mlock/mlockallto pin pages, plus pre-touching the working set at startup. - Forgetting that fork() is lazy. A forked child that then writes heavily can trigger a storm of COW faults, briefly doubling memory and stalling — surprising people who thought the copy already happened at fork time.
- Major faults hidden as "slow code." A profiler shows time inside a memcpy, but the real cost is the major fault servicing the destination page. Always check
maj_fltbefore optimizing the code itself. - Overcommit and the OOM killer. Because allocation is lazy, the kernel hands out more virtual memory than it has RAM+swap (overcommit). If everyone writes their pages at once, there's nowhere to put them and the OOM killer reaps a process — far from the malloc that "succeeded."
- Mistaking every fault for a problem. Minor faults are the normal, designed behavior of a running program. Only the major-fault rate and the segfault are signals worth acting on.
Frequently asked questions
What is the difference between a minor and a major page fault?
A minor (soft) fault is resolved without disk I/O — the page is already in RAM, the kernel just fixes the page table entry (a shared library already cached, a copy-on-write copy, a zero page). A major (hard) fault must read the page from disk or swap, which costs roughly 100 microseconds on an SSD or 5–10 milliseconds on a spinning disk — tens of thousands of times slower than a minor fault.
What exactly happens when a page fault fires?
The MMU finds no valid translation, raises a fault, and the CPU traps into the kernel's page-fault handler with the faulting address and access type. The handler checks the address against the process's virtual memory areas. If the access is legal it allocates or fetches the page, updates the page table, flushes the relevant TLB entry, and returns; the faulting instruction is then re-executed. If illegal, it delivers SIGSEGV.
Why does malloc of a gigabyte return instantly?
Because demand paging allocates lazily. malloc reserves virtual address space and the kernel marks the pages as not-yet-present. No physical RAM is touched until you actually write to a page, at which point a minor fault maps a fresh zero page. A program that mallocs 1 GB but writes only 4 KB consumes 4 KB of physical memory, not 1 GB.
What is thrashing and how does demand paging cause it?
Thrashing is when the active working set exceeds physical RAM, so every fault evicts a page that is about to be needed again. The system spends nearly all its time servicing major faults — paging to and from disk — instead of running code. Throughput collapses by orders of magnitude. The fix is more RAM, a smaller working set, or the OOM killer reclaiming a process.
How does copy-on-write use page faults?
After fork(), parent and child share every page read-only. The first write by either process triggers a protection fault; the handler copies just that one 4 KB page, marks both copies writable, and resumes. So forking a 4 GB process duplicates only the pages that are actually written — often a few hundred KB — instead of the whole address space.
Is a page fault the same as a segmentation fault?
No. A page fault is a normal, expected mechanism that the kernel handles silently most of the time. A segmentation fault is what you see when the page-fault handler decides the access is illegal — outside any mapped region, or a write to a read-only page — and delivers SIGSEGV. Almost every segfault is a page fault the handler rejected, but the vast majority of page faults are not segfaults.