Systems
Page Replacement Algorithms
When memory is full, which page do you throw away?
Page replacement algorithms decide which memory page to evict on a page fault when RAM is full — FIFO, LRU, the Clock approximation, and Belady's theoretical optimum that no real OS can implement because it requires seeing the future.
- Runs onPage fault, no free frame
- GoalMinimize future faults
- Optimal (offline)Belady's MIN / OPT
- Practical defaultClock (LRU approximation)
- Major-fault cost~10 µs SSD / ~5 ms disk
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
When memory fills up, something has to go
Virtual memory lets a process believe it has its own vast, contiguous address space. Physical RAM is finite, so the operating system keeps only the pages a process is actively using resident, and the rest live on disk in a swap area or backing file. When the program touches a page that isn't resident, the hardware raises a page fault and the OS loads that page from disk. But if every physical frame is already occupied, there's no room — the OS must evict one resident page to make a free frame. The rule that picks the victim is a page replacement algorithm.
The whole game is prediction. The page you evict should be the one you're least likely to need soon, because if you guess wrong you'll fault on it again immediately and pay another disk round-trip. A page fault that's satisfied without disk I/O (a minor fault) costs perhaps a few hundred nanoseconds. A fault that has to read the page back from disk (a major fault) costs tens of microseconds on an SSD or several milliseconds on a spinning disk — a 10,000× difference. Across a memory-hungry workload doing millions of accesses, the difference between a smart policy and a dumb one is the difference between a snappy machine and one that grinds to a halt thrashing the swap device.
Every practical algorithm is trying to approximate one ideal — and that ideal is provably the best possible, yet provably impossible to run.
Belady's optimal: the algorithm you can't have
In 1966 László Bélády proved that one strategy minimizes the number of page faults for any reference string: evict the page whose next use lies farthest in the future. This is variously called OPT, MIN, or the clairvoyant algorithm. It's optimal because by definition the page you keep is needed sooner than the one you discard, so no other choice can fault less often.
The catch is in the words "farthest in the future." To run OPT you'd need the complete sequence of every page the program will ever touch — its future. A live program doesn't hand you its future; the next memory access might depend on user input, a network packet, or a random number. So OPT is unimplementable as an online algorithm. Its only real use is as a yardstick: replay a recorded reference trace through OPT offline, count the faults, and you have the theoretical floor against which to judge LRU, Clock, and FIFO.
That comparison is sobering and reassuring at once. On typical workloads, LRU lands within roughly 10–40% of OPT's fault count, and Clock lands close behind LRU. The expensive perfect algorithm and the cheap approximate one are usually within a small constant of each other — which is exactly why nobody loses sleep over not having a crystal ball.
LRU, FIFO, and the locality bet
FIFO (First-In, First-Out) evicts the page that has been resident longest, regardless of how recently it was used. It's trivial — a queue of frames — but it ignores usage entirely, so it happily evicts a hot page just because it was loaded early. FIFO is also the classic algorithm that exhibits Belady's anomaly — any non-stack policy can, but FIFO is the textbook example.
LRU (Least Recently Used) evicts the page that hasn't been touched for the longest time. This is a bet on temporal locality: a page used recently is likely to be used again soon. LRU is the workhorse approximation of OPT — instead of looking forward (impossible), it looks backward and assumes the recent past predicts the near future. On programs that loop over working sets, this bet pays off constantly.
The trouble is implementing true LRU cheaply. To know the exact least-recently-used page you must track an ordering that changes on every single memory access — billions per second. Doing that in software per access is hopeless, and hardware that timestamps or reorders a list on every load would be prohibitively expensive. So real systems don't run true LRU on page frames; they approximate it. (Note: an LRU cache at the application level can be exact and O(1) using a hash map plus a doubly linked list — but that's because software controls every access. The OS page table is updated by the MMU in hardware, which only affords a single reference bit.)
Clock: LRU on a budget
The Clock algorithm (also called second-chance) is how LRU is actually approximated in production kernels. The hardware gives each page one reference bit, set automatically whenever the page is accessed and clearable by the OS. Frames are arranged in a circular buffer, and a single pointer — the "clock hand" — sweeps around it looking for a victim:
- Look at the page under the hand.
- If its reference bit is 1, the page was used recently. Give it a second chance: clear the bit to 0 and advance the hand.
- If its reference bit is 0, it hasn't been touched since the last sweep past it. Evict it.
The hand keeps moving until it finds a 0. A page that keeps getting referenced keeps getting its bit re-set before the hand returns, so it survives; a page nobody touches eventually meets the hand with a clear bit and gets evicted. The result behaves like a coarse LRU — old, cold pages die — but it costs only one bit per page and O(1) amortized work, because the hand resumes where it left off rather than scanning from the start.
The enhanced (two-bit) Clock adds a dirty (modified) bit and prefers to evict pages that are both unreferenced and clean, since a clean page can be dropped instantly while a dirty one must first be written back to disk. It classifies pages into four buckets — (referenced, dirty) — and sweeps for the cheapest victim first.
Belady's anomaly: more memory, more faults
Here's the result that breaks every newcomer's intuition. You'd expect that adding page frames can only help — more memory should mean fewer faults. For FIFO, that's false. Bélády found reference strings where giving FIFO more frames causes more faults. The textbook example is the string 1 2 3 4 1 2 5 1 2 3 4 5:
- With 3 frames, FIFO takes 9 faults.
- With 4 frames, FIFO takes 10 faults.
The extra frame changes the eviction order in a way that throws out pages right before they're needed again. LRU and OPT are immune to this because they're stack algorithms: the set of pages resident with k frames is always a subset of the set resident with k+1 frames (the inclusion property). FIFO has no such guarantee, so adding a frame can reorder which pages survive. Belady's anomaly is the single sharpest argument for preferring LRU-like policies over plain FIFO.
Algorithm comparison
| OPT / Belady | LRU (true) | Clock / Second-chance | FIFO | Random | LFU | |
|---|---|---|---|---|---|---|
| Faults vs optimal | 1.0× (the floor) | ~1.1–1.4× | ~1.1–1.5× | ~1.5–2×+ | ~1.5–2× | Varies, can be poor |
| Implementable online? | No (needs future) | Costly (per-access) | Yes (cheap) | Yes (trivial) | Yes (trivial) | Yes (counters) |
| Per-page state | none (offline trace) | full recency order | 1 ref bit (+1 dirty) | none | none | access counter |
| Work per eviction | O(n) scan of future | O(1) with list, but O(1) per access too | O(1) amortized | O(1) | O(1) | O(log n) heap |
| Stack algorithm? | Yes | Yes | Approximate | No | No | No |
| Belady's anomaly? | No | No | Rare/approx | Yes | Possible | Possible |
| Real-world use | Offline benchmark only | App caches, theory | Linux, BSD, Windows page reclaim | Simple buffers | Some hardware caches | Some object caches |
The headline: nobody runs OPT, almost nobody runs true LRU on page frames, and the practical winner is Clock because it captures most of LRU's benefit at a fraction of the cost. FIFO survives only where simplicity outweighs the anomaly risk.
What the numbers actually say
- A major page fault costs ~10,000× a minor one. A minor fault (page already in RAM, just remap) is a few hundred nanoseconds; a major fault reads ~4 KB from storage — about 10–100 µs on NVMe SSD, or 3–10 ms on a 7,200-RPM disk after seek and rotational latency.
- The Belady-anomaly example: 9 vs 10 faults. On the 12-reference string
1 2 3 4 1 2 5 1 2 3 4 5, FIFO faults 9 times with 3 frames and 10 times with 4 frames — a 4-frame configuration that's measurably worse despite 33% more memory. - LRU sits within ~10–40% of OPT on typical SPEC-style traces; Clock is usually within a few percent of LRU. The gap between "perfect but impossible" and "cheap and shippable" is small.
- Clock state is 1 bit per page (2 bits for the enhanced version), versus the full recency ordering true LRU would demand — which is why every major kernel ships a Clock variant, not exact LRU.
- Thrashing is the failure mode. If the combined working sets exceed RAM, any policy faults on nearly every access and the machine spends ~100% of its time in disk I/O. No replacement algorithm fixes thrashing — only adding RAM or reducing the working set does.
JavaScript implementation
Here are the three online policies plus offline OPT, each as a fault-counting simulator over a reference string with a fixed number of frames.
// FIFO — evict the oldest-loaded page.
function fifo(refs, frames) {
const set = new Set();
const queue = [];
let faults = 0;
for (const p of refs) {
if (set.has(p)) continue; // hit
faults++; // miss → fault
if (set.size === frames) {
const victim = queue.shift(); // oldest in
set.delete(victim);
}
set.add(p); queue.push(p);
}
return faults;
}
// LRU — evict the page unused for the longest time.
function lru(refs, frames) {
const recent = new Map(); // page → last-use index
let faults = 0;
refs.forEach((p, i) => {
if (!recent.has(p)) {
faults++;
if (recent.size === frames) {
// find the least-recently-used page
let lruPage, oldest = Infinity;
for (const [page, t] of recent)
if (t < oldest) { oldest = t; lruPage = page; }
recent.delete(lruPage);
}
}
recent.set(p, i); // touch updates recency
});
return faults;
}
// CLOCK — second-chance approximation of LRU with one ref bit per frame.
function clock(refs, frames) {
const slots = new Array(frames).fill(null); // page in each frame
const ref = new Array(frames).fill(0); // reference bit
const index = new Map(); // page → frame index
let hand = 0, used = 0, faults = 0;
for (const p of refs) {
if (index.has(p)) { ref[index.get(p)] = 1; continue; } // hit → set bit
faults++;
if (used < frames) { // empty frame available
slots[used] = p; ref[used] = 1; index.set(p, used); used++;
continue;
}
while (ref[hand] === 1) { // sweep, granting second chances
ref[hand] = 0;
hand = (hand + 1) % frames;
}
index.delete(slots[hand]); // victim found (ref bit 0)
slots[hand] = p; ref[hand] = 1; index.set(p, hand);
hand = (hand + 1) % frames;
}
return faults;
}
// OPT (Belady) — offline: evict the page whose next use is farthest away.
function opt(refs, frames) {
const set = new Set();
let faults = 0;
for (let i = 0; i < refs.length; i++) {
const p = refs[i];
if (set.has(p)) continue;
faults++;
if (set.size === frames) {
let victim = null, farthest = -1;
for (const page of set) {
// distance to next use of `page` after position i
let next = refs.indexOf(page, i + 1);
if (next === -1) { victim = page; break; } // never used again → evict
if (next > farthest) { farthest = next; victim = page; }
}
set.delete(victim);
}
set.add(p);
}
return faults;
}
const string = [1,2,3,4,1,2,5,1,2,3,4,5];
console.log(fifo(string, 3), fifo(string, 4)); // 9 10 ← Belady's anomaly
console.log(lru(string, 3), opt(string, 3)); // 10 7
Two things to notice. In clock, the hand resumes from its last position across calls — that's what makes the sweep amortized O(1) rather than a full scan each time. In opt, the indexOf(page, i + 1) lookahead is exactly the "see the future" step that no online algorithm can perform; it only works because we're replaying a recorded trace.
Python implementation
The same four policies in Python, plus a tiny driver that prints faults for each.
from collections import OrderedDict, deque
def fifo(refs, frames):
resident, q, faults = set(), deque(), 0
for p in refs:
if p in resident:
continue
faults += 1
if len(resident) == frames:
resident.discard(q.popleft())
resident.add(p); q.append(p)
return faults
def lru(refs, frames):
cache, faults = OrderedDict(), 0 # ordered by recency, oldest first
for p in refs:
if p in cache:
cache.move_to_end(p) # mark most-recently-used
continue
faults += 1
if len(cache) == frames:
cache.popitem(last=False) # evict least-recently-used
cache[p] = True
return faults
def clock(refs, frames):
slots = [None] * frames
ref = [0] * frames
pos = {} # page -> frame index
hand = used = faults = 0
for p in refs:
if p in pos:
ref[pos[p]] = 1 # hit -> set reference bit
continue
faults += 1
if used < frames:
slots[used] = p; ref[used] = 1; pos[p] = used; used += 1
continue
while ref[hand] == 1: # second-chance sweep
ref[hand] = 0
hand = (hand + 1) % frames
del pos[slots[hand]]
slots[hand] = p; ref[hand] = 1; pos[p] = hand
hand = (hand + 1) % frames
return faults
def opt(refs, frames):
resident, faults = set(), 0
for i, p in enumerate(refs):
if p in resident:
continue
faults += 1
if len(resident) == frames:
def next_use(page):
try: return refs.index(page, i + 1)
except ValueError: return float('inf') # never again
victim = max(resident, key=next_use)
resident.discard(victim)
resident.add(p)
return faults
s = [1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5]
print(fifo(s, 3), fifo(s, 4)) # 9 10 -> Belady's anomaly
print(lru(s, 3), opt(s, 3)) # 10 7
The Python LRU leans on OrderedDict.move_to_end and popitem(last=False) — the same data structure behind functools.lru_cache. That gives O(1) per access in software, which is precisely the move the hardware MMU can't make on a page table, forcing kernels to fall back to Clock.
Variants worth knowing
NRU (Not Recently Used). Coarsens Clock further: classify pages into four classes by (referenced, dirty) bits and pick a random victim from the lowest non-empty class. Cheap, used in some early Unix systems.
Aging / NFU. Approximate LRU with a multi-bit counter per page. Periodically shift each counter right and OR the reference bit into the top — pages used recently keep high counters, idle pages decay toward zero. Evict the lowest counter. Captures more history than a single ref bit.
WSClock (Working-Set Clock). Combines Clock with the working-set model: the hand evicts a page only if it's both unreferenced and older than a time threshold τ relative to the process's virtual time. Avoids evicting pages still in the active working set.
ARC (Adaptive Replacement Cache). Patented by IBM, used in ZFS. Keeps two LRU lists — one for pages seen once (recency) and one for pages seen multiple times (frequency) — plus ghost lists of recently-evicted page identities, and self-tunes the split between them. Beats plain LRU on scan-heavy and looping workloads.
2Q and LIRS. Two more scan-resistant designs. 2Q uses a FIFO probationary queue plus a main LRU queue so a one-shot sequential scan doesn't flush the hot set — MySQL's InnoDB buffer pool uses a closely related midpoint-insertion LRU with "young" and "old" sublists. LIRS tracks inter-reference recency instead of plain recency and influenced later scan-resistant designs, including ideas in Linux's page reclaim.
Common bugs and edge cases
- Forgetting to clear the reference bit during the Clock sweep. If the hand sets nothing to 0, every page keeps its second chance forever and the hand loops endlessly. The sweep must clear bits as it passes.
- Counting a hit as a fault. Replacement runs only on a miss with no free frame. Re-referencing a resident page should update recency / set the ref bit, never increment the fault counter.
- Assuming more frames always helps. With FIFO it doesn't — that's Belady's anomaly. Benchmark across frame counts; don't extrapolate from one.
- Implementing "true LRU" on page frames. It's correct but unaffordable at hardware access rates. On the OS page table, approximate with Clock; reserve exact LRU for software-level caches where you control every access.
- Ignoring the dirty bit. Evicting a clean page is free; evicting a dirty one forces a write-back. A policy that ignores this can double its I/O on write-heavy workloads. Use enhanced Clock.
- Treating replacement as a thrashing fix. When working sets exceed RAM, no policy helps — every choice faults. The fix is more memory, fewer concurrent processes, or working-set-aware admission, not a cleverer victim rule.
Frequently asked questions
Why can't an operating system just use Belady's optimal algorithm?
Belady's algorithm evicts the page whose next use is farthest in the future, which requires knowing the entire future reference string. A running program's future accesses are unknown, so OPT is unimplementable online. It's used only as an offline yardstick to measure how close practical algorithms get.
What is Belady's anomaly?
Belady's anomaly is the counter-intuitive result that giving FIFO more page frames can cause MORE page faults, not fewer. The classic reference string 1,2,3,4,1,2,5,1,2,3,4,5 produces 9 faults with 3 frames but 10 faults with 4 frames. Stack algorithms like LRU and OPT never suffer this because their resident set with k frames is always a subset of the resident set with k+1 frames.
How does the Clock algorithm approximate LRU?
Clock keeps frames in a circular list with a one-bit reference flag per page, set by hardware on every access. A hand sweeps the circle: if the page under the hand has its reference bit set, Clock clears it and skips ahead (a second chance); if the bit is already clear, that page is evicted. This costs O(1) amortized and one bit per frame, versus LRU's expensive full ordering.
Is LRU the same as the Clock algorithm?
No. True LRU evicts the exact least-recently-used page and needs a full recency ordering, which is too costly to maintain in hardware on every memory access. Clock (also called second-chance) is a cheap approximation using a single reference bit per page; it evicts an old, un-referenced page but not necessarily THE oldest one.
What's the difference between a minor and a major page fault here?
Page replacement only runs when a fault requires a free frame and none exist. A minor fault is satisfied from a page already in memory (e.g., the free list or page cache) with no disk I/O. A major fault must read the page from disk or swap — tens of microseconds on SSD, several milliseconds on a hard disk — which is exactly why a good replacement policy matters.
How does Linux choose which page to evict?
Linux uses a variant of Clock with two LRU lists — active and inactive — and per-page reference bits (the kswapd / Multi-Generational LRU machinery). Pages age from active to inactive; clean inactive pages are dropped for free, dirty ones are written back first. It's an LRU approximation tuned to avoid scanning every page on every fault.