Systems
Virtual Memory
Each process gets its own pretend address space — paged on demand
Virtual memory gives each process its own private address space, with the OS translating virtual addresses to physical RAM via page tables. Pages of memory can be on disk (swap), shared between processes, copy-on-write, or memory-mapped files. The illusion that every program has its own dedicated multi-gigabyte memory is one of the most successful abstractions in computing.
- Default page size4 KB on most systems; 2 MB / 1 GB huge pages
- Address space (64-bit)256 TB virtual; usable typically up to 128 TB
- Translation costTLB hit ~1 cycle; TLB miss ~50 cycles + page walk
- Swap penalty~50,000-1,000,000× slower than RAM (disk vs RAM)
- Common featuresDemand paging, COW, mmap, swap, transparent huge pages
- Implemented byMMU (hardware) + OS (kernel)
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why virtual memory exists
Without virtual memory, every program would need to know exactly where in physical RAM it's running, manage allocations within a fixed pool, and avoid stepping on other programs' memory. Three problems virtual memory solves:
- Isolation. Each process gets its own address space. Process A can't read or corrupt process B's memory — even at the same virtual address, they refer to different physical RAM.
- Abstraction over physical layout. Programs use virtual addresses; the OS handles where the data physically lives. Programs don't care if memory is fragmented — virtual addresses are contiguous.
- Larger-than-RAM workloads. Pages can be on disk (swap) or in memory-mapped files. The address space can exceed physical RAM; the OS swaps pages in and out as needed.
The cost is one extra memory access per load/store (the page table walk) and the complexity of the OS managing it. The benefits dwarf the cost — every modern OS, from desktop to mobile to embedded, uses virtual memory.
How address translation works
A 64-bit virtual address is split into:
- Page offset (bottom 12 bits) — byte position within a 4 KB page.
- Virtual page number (top bits) — used to index a multi-level page table.
x86-64 uses 4-level page tables (each level is 9 bits, 512 entries × 8 bytes = 4 KB per table). Translation walks 4 levels, reading 4 page-table entries from memory in the worst case. A successful walk produces a physical page number; combined with the offset, that's the physical address.
The TLB (Translation Lookaside Buffer) caches recent virtual-to-physical translations — typically 64-512 entries on modern CPUs. TLB hits are ~1 cycle; misses force a page-table walk. This is why TLB miss rate is a critical performance metric for memory-heavy workloads.
Page faults — three flavors
| Type | What happened | Latency | OS action |
|---|---|---|---|
| Minor (soft) | Page exists but not yet mapped to a physical frame | ~1 µs | Allocate frame, update page table |
| Major (hard) | Page must be read from disk (swap or mmap'd file) | ~10 ms (disk) / 100 µs (SSD) | Issue I/O, suspend process, schedule page-in |
| Invalid (segfault) | Address was never mapped — bug in the program | Process dies | Send SIGSEGV; usually kills the process |
Programs only notice major page faults via latency. Production system metrics — high major-fault rate means the working set exceeds RAM and the system is paging. At that point, performance falls off a cliff.
What virtual memory enables
- Demand paging. When a process starts, no pages are loaded; pages get faulted in only when accessed. A 100 MB executable might run with only ~10 MB resident — the rest never gets touched.
- Memory-mapped files (mmap). Map a file into address space; OS pages it in lazily. Used by databases (avoids buffered-read copy), JVMs (mmap'd JIT code), and large-file processing.
- Copy-on-write (COW). Shared pages marked read-only; writing forks a private copy. Makes fork() fast on huge processes — only modified pages duplicate.
- Shared memory. Multiple processes can map the same physical pages. Used by shared libraries (libc loaded once, mapped into every process), inter-process communication (POSIX shm, mmap'd files), and database shared buffer pools.
- Memory protection. Pages have permissions (R/W/X). Stack pages typically not executable (defense against buffer-overflow code injection). Code pages are read-only + executable.
- Address space layout randomization (ASLR). Stack, heap, libraries placed at randomized addresses. Defeats some attacks that need to know a specific address.
- Swap. Pages can be evicted to disk and re-fetched on demand. Lets processes use more memory than physical RAM, at huge latency cost.
Page sizes and the TLB tradeoff
| Page size | TLB coverage (512 entries) | Best for | Concern |
|---|---|---|---|
| 4 KB (default) | 2 MB | General-purpose, fine-grained allocation | TLB misses on memory-heavy workloads |
| 2 MB (huge page) | 1 GB | Databases, large arrays, JVM heaps | Coarse allocation; harder to free fragments |
| 1 GB (gigantic page) | 512 GB | HPC, multi-TB databases | Reservation up front; almost no flexibility |
Linux's Transparent Huge Pages (THP) auto-promotes 4KB to 2MB transparently. Often a win, sometimes hurts latency-sensitive workloads (the promotion itself is a sync stall). Disabled in production for many databases.
Memory-mapped file example
// C — read a 1GB file via mmap (no read() call, no buffer copy)
int fd = open("largefile.dat", O_RDONLY);
struct stat st;
fstat(fd, &st);
char *data = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Now data[0..st.st_size] reads the file contents lazily.
// Only the pages we touch get loaded.
process(data, st.st_size);
munmap(data, st.st_size);
close(fd);
vs traditional read():
// Traditional — copies file contents into a user buffer
char *buf = malloc(st.st_size);
read(fd, buf, st.st_size); // potentially huge memcpy
process(buf, st.st_size);
free(buf);
mmap is faster on large files because:
- No memcpy from kernel buffer to user buffer.
- Only-touched pages get loaded; unused pages never read.
- Multiple processes can share the mapping (read-only).
Copy-on-write with fork()
Linux's fork() creates a child process by duplicating the parent. With virtual memory and COW, this is cheap:
- Child gets a copy of the parent's page table, but pointing at the same physical pages.
- All pages marked read-only in both parent and child.
- Either process writing to a page triggers a page fault; the OS copies the page and updates each process's table to point at its own copy.
fork() of a 4GB process completes in microseconds — only metadata is touched. Pages copy lazily as either side writes. This is why Redis can fork() to take a snapshot — the child sees the parent's memory frozen in time, slowly diverging only when modified.
When virtual memory behavior bites you
- Working set exceeds RAM. Major faults dominate; performance falls 1000× as pages thrash to disk. Detect via vmstat, sar, or the kernel's PSI (pressure stall information). Solutions — more RAM, smaller working set, page-frame reclaim tuning.
- TLB misses dominate. Memory-heavy workloads with bad spatial locality. Solutions — huge pages (1000× more TLB coverage), better data locality (sort access patterns), software prefetching.
- NUMA effects. On multi-socket systems, each socket has its own memory; cross-socket access is 2-3× slower. Pin processes and memory to the same NUMA node; use libnuma or numactl.
- OOM killer striking. When physical RAM + swap exhausts, Linux's OOM killer picks a process to kill (highest "badness" score). Servers typically configure swap and OOM behavior to predictably target least-important processes.
- Memory mapped file performance. mmap is great for sequential / random access, but bad for tiny scattered touches (each touch is a page fault). For small files or write-heavy patterns, read/write may be faster.
Common virtual-memory pitfalls
- Allocating huge contiguous virtual ranges that never fault. A program can mmap 100GB even on a 4GB machine — RSS stays small until pages are touched. Looking at virtual size (VSZ) without RSS gives misleading "memory hog" signals.
- Confusing virtual size with physical usage. Tools like ps/top show both. Production monitoring should focus on RSS (resident set) and major-fault rate, not VSZ.
- Page-aligned allocation requirements forgotten. mmap requires offsets aligned to page boundaries. Many low-level APIs (DMA, hardware buffers) require specific alignment. Miss the alignment, get EINVAL.
- Writing to a COW page in a fork()ed child. Fine, but each first-write costs a page-fault + copy. For workloads that mostly diverge after fork, COW saves nothing — actually slower than separate copies. Profile.
- Disabling swap entirely. Common advice for databases, but extreme. With no swap, OOM is more likely on memory pressure. A small swap (1-2 GB) provides emergency room without penalizing performance.
- Assuming free memory is "wasted." Linux uses free RAM for page cache (recently-read file data). High "used" memory in
freeoutput that's actuallycachedis fine — it gets evicted instantly when needed.
Frequently asked questions
How does virtual-to-physical translation work?
Every memory access goes through the MMU (Memory Management Unit). The CPU presents a virtual address; the MMU walks the page table (in hardware) to find the corresponding physical address. If the page table entry is valid and the page is in RAM, translation is fast. The TLB (Translation Lookaside Buffer) caches recent translations — TLB hits are ~1 cycle; misses cost 4+ memory accesses to walk the table.
What's a page fault?
An exception raised by the MMU when it can't translate a virtual address. Three flavors. (1) Minor — page is mapped but not in RAM (e.g., swap or memory-mapped file); the OS reads it in. (2) Major — same as minor but requires disk I/O (slow, ~10ms). (3) Invalid — the address was never mapped; segfault, kill the process. The fault handler is invisible to the program (except for the latency); pages get loaded on demand.
What's copy-on-write (COW)?
When two processes share a page, the OS marks it read-only. Either process can read it; either trying to write triggers a page fault. The handler copies the page so each process gets a private copy. fork() uses COW for memory — child and parent share pages until one writes. Cheap fork even for large processes; only modified pages duplicate.
What's mmap?
A system call that maps a file or anonymous memory into a process's address space. Reads/writes to the mapped range translate to file I/O lazily by the kernel. Faster than read()/write() for large files because no buffer copy. Used by databases (PostgreSQL, SQLite), language runtimes (mmap'd JIT code), and applications doing zero-copy I/O.
Why is swap so slow?
A swapped page must be read from disk before access. Spinning disk seek is ~10 ms (vs ~100 ns for RAM) — 100,000× slower. SSD is ~100 µs (vs RAM 100 ns) — 1000× slower. Either way, swap dramatically affects performance. When working set exceeds RAM, "thrashing" can dominate — the OS spends most CPU paging instead of running the program.
What are huge pages and when should I use them?
2 MB or 1 GB pages instead of 4 KB. Reduce TLB pressure (one TLB entry covers more memory) and skip levels of the page table walk. Used by databases (Postgres, MySQL), HPC, and JVMs with large heaps. Trade-off — finer-grained memory management is harder; allocation may waste memory. Transparent Huge Pages (THP) automatically promote 4KB pages to 2MB; sometimes hurts latency-sensitive workloads.
How does virtual memory enable multitasking?
Each process has its own page table; the OS swaps page tables on context switch. Each process sees the same range of virtual addresses (e.g., 0x1000 to start of stack), but they map to different physical RAM. Processes are isolated from each other — one can't accidentally touch another's memory. The TLB also gets flushed (or tagged) on context switch to prevent translation leaks.