Systems
Memory-Mapped I/O
Make a file act like an array
Memory-mapped I/O uses mmap to attach a file's pages directly into a process address space. Reads and writes become ordinary loads and stores; the kernel pages data in on demand. It eliminates explicit read/write syscalls and is the foundation of databases, shared-memory IPC, and binary loaders.
- Page size (default)4 KB
- Hugepage size2 MB / 1 GB
- Minor page fault~1 µs
- Major page fault (SSD)~100 µs
- Read syscall overhead~100 ns + memcpy
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How memory mapping works
Every modern OS already presents process memory as a translated view of physical RAM via the page table. mmap just lets you ask for a chunk of that virtual address space to be backed by something specific — a file, an anonymous slab, a device. The kernel sets up the page-table entries but does not populate them. The physical pages are only fetched when you actually touch them.
The lifecycle of a single page in a file mapping:
mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off)returns a virtual address. No data is read.- You dereference an address — say
*(int*)(addr + 8192). The MMU finds no valid PTE for that virtual page and traps into the kernel: a page fault. - The fault handler looks up the file, reads the corresponding page from disk into the page cache (or finds it already there), and installs a PTE pointing at it.
- The original load instruction is retried and now succeeds. From the program's perspective it was just a memory access.
- Subsequent stores dirty the page. The kernel's writeback thread will eventually flush it back to the file (for
MAP_SHARED) on its own schedule, or immediately onmsync(MS_SYNC).
This lazy-fault model is the whole point. A 100 GB file mapped on a 16 GB box is fine — only the actively touched pages live in RAM. The kernel evicts cold pages back to disk under memory pressure, just like its normal page cache. From the application's point of view, the file is a byte array that happens to be larger than memory.
When to use mmap
- Random access into large files — databases, search indexes, columnar stores.
- Shared-memory IPC between cooperating processes (file-backed or
MAP_ANONYMOUS|MAP_SHARED). - Loading executables and shared libraries — every
ld.soon Linux uses mmap. - Writing to a file that you want to manipulate as a struct or array, with the kernel handling persistence.
Avoid mmap for streaming sequential reads of files larger than RAM — explicit read with a fat buffer plus posix_fadvise(SEQUENTIAL) hints often beats it because read-ahead is more aggressive and you don't pay a fault per page. Avoid it for files on networked filesystems too: NFS mmap semantics are subtle and bug-prone.
mmap vs read/write vs O_DIRECT
| mmap (file-backed) | read/write | O_DIRECT | mmap (anonymous, MAP_SHARED) | tmpfs / shm_open | io_uring + registered buffers | |
|---|---|---|---|---|---|---|
| User-kernel copies | 0 (page table magic) | 1 per syscall | 0 (DMA into user buffer) | 0 | 0 (RAM-only file) | 0 (pinned buffer) |
| Syscall per access | 0 after fault | 1 per call | 1 per call | 0 after fault | 0 after fault | 0 (sqe ring) |
| Random access | Excellent | Painful — seek + read | Painful — seek + read, alignment-required | Excellent | Excellent | Good |
| Sequential bulk | Good (with MADV_SEQUENTIAL) | Best (read-ahead friendly) | Excellent if aligned | N/A | Excellent | Excellent |
| Page-cache use | Yes — shared with read/write of same file | Yes | No (bypassed) | Yes | Yes (but pages == storage) | Yes by default |
| SIGBUS risk on truncate | Yes | No | No | No | No | No |
| Best for | Databases, parsers, BLOB stores | Sequential streaming | Direct-to-device databases (Postgres, Oracle) | Cross-process IPC over a file | POSIX shm regions, fast caches | Async high-fanout I/O |
The single most important row is "Random access". Reading 8 KB at offset 50 GB into a file with pread takes ~100 µs of disk plus a syscall. Doing it via a mapped pointer is the same disk read, no syscall, and on subsequent touches of the same page it's a memory-speed hit (~30 ns). Databases live or die on this.
Python: mmap a file as a slice-able buffer
import mmap, os
path = "/var/data/index.bin"
with open(path, "r+b") as f:
size = os.fstat(f.fileno()).st_size
with mmap.mmap(f.fileno(), size, access=mmap.ACCESS_WRITE) as mm:
# Treat the file as a bytearray
header = mm[:16]
mm[1024:1028] = (42).to_bytes(4, "little")
# Tell the kernel we'll touch sequential pages
mm.madvise(mmap.MADV_SEQUENTIAL)
mm.flush() # msync — push dirty pages back to disk
Python's mmap module exposes a memoryview-style object you can slice, search, and even re.search over without ever pulling the whole file into memory. madvise lets you tell the kernel about your access pattern: MADV_RANDOM disables read-ahead, MADV_DONTNEED drops resident pages, MADV_HUGEPAGE requests transparent hugepages.
C: shared anonymous mapping for IPC
#include <sys/mman.h>
#include <unistd.h>
#include <stdatomic.h>
typedef struct { atomic_int counter; char payload[4080]; } region_t;
int main(void) {
region_t *r = mmap(NULL, sizeof(region_t),
PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (r == MAP_FAILED) return 1;
if (fork() == 0) { // child
atomic_fetch_add(&r->counter, 1);
_exit(0);
}
// parent
sleep(1);
// r->counter is incremented in the parent's address space too
}
MAP_ANONYMOUS | MAP_SHARED creates a region that survives fork and is shared between parent and child without a backing file. Every concurrency primitive in libpthread for cross-process locks (process-shared mutexes, futexes) lives in pages set up exactly like this. Place an atomic in the first cache line and you have lock-free IPC.
Node.js: shared array buffers and N-API mmap
// Within a single process, SharedArrayBuffer + Worker threads is the JS equivalent
import { Worker, isMainThread, parentPort, workerData } from 'node:worker_threads';
if (isMainThread) {
const sab = new SharedArrayBuffer(4096);
const view = new Int32Array(sab);
new Worker(new URL(import.meta.url), { workerData: sab });
Atomics.wait(view, 0, 0);
console.log('worker wrote', view[0]);
} else {
const view = new Int32Array(workerData);
Atomics.store(view, 0, 42);
Atomics.notify(view, 0);
}
Node's standard library has no direct mmap binding. SharedArrayBuffer is mmap-flavored shared memory between Worker threads. For true file-backed mmap, native addons like mmap-io wrap the syscall — useful when you want zero-copy reads of multi-gigabyte log files without buffering them through V8.
Mapping variants
- File-backed, MAP_SHARED. The default for "treat a file as memory". Stores hit the file. Other mappers see your writes. Used by databases.
- File-backed, MAP_PRIVATE (copy-on-write). First write to a page makes a private copy; the file stays untouched. Used by binary loaders to map executable text without ever modifying it on disk.
- Anonymous, MAP_PRIVATE. What
mallocuses for big allocations. The kernel zero-fills pages on first touch. - Anonymous, MAP_SHARED. Cross-fork shared memory without a file. Cheaper than POSIX shm for parent/child only.
- Hugepage mappings (
MAP_HUGETLB,HUGETLB_FLAG_2MB). 2 MB or 1 GB pages reduce TLB pressure 512× per entry — DBs and JVMs use them for heap regions. - Transparent hugepages. The kernel opportunistically promotes 4 KB pages to 2 MB. Faster-on-average but introduces pause-time spikes from background defragmentation.
- POSIX
shm_open+ mmap. A tmpfs-backed file in/dev/shm. Survives the lifetime of the system, not just a fork tree.
Costed claims
- Page-fault cost: minor fault ~1 µs (page already in cache, just install PTE), major fault ~100 µs on NVMe SSD, ~10 ms on spinning disk. A
readsyscall to a hot page costs ~100 ns + memcpy. - TLB reach with 4 KB pages: ~2 MB on a typical x86 with 512 TLB entries. With 2 MB hugepages, the same TLB covers 1 GB — a 512× reduction in TLB miss rate for large working sets.
- Address space cost: on 64-bit Linux you have 47 bits of usable virtual range (~128 TB). Mapping 1 TB of files is routine; the kernel doesn't materialize physical pages until touched.
- Writeback throughput: the kernel's
kflushdcoalesces dirty pages into multi-MB writes, sustaining roughly the device's sequential write speed (~3 GB/s on NVMe Gen4).msyncblocks until those writes complete.
Common bugs and edge cases
- SIGBUS on truncated file. If the file shrinks below the mapping length, touching the truncated tail kills the process with
SIGBUS. Guard withftruncatebefore extending the mapping or install a SIGBUS handler. - Forgetting to msync before unmap. Unmapping a dirty
MAP_SHAREDregion usually flushes lazily, but if the process crashes or the system loses power, dirty pages can be lost. Databases callmsync(MS_SYNC)before commit fences. - Holding a write lock across page faults. A page fault inside a critical section can take milliseconds (major fault) and stall every waiter. This is why JVMs prefault their heap and databases prewarm indexes.
- Address space exhaustion on 32-bit. 2–3 GB of usable virtual range — mapping a 4 GB file directly fails. Sliding-window mmap (re-map small ranges as you scan) is the workaround.
- NFS mmap surprises. Cache coherence is per-client; another machine writing the same file may not be visible until the next attribute revalidation. Avoid mmap for shared data on NFS unless you understand the close-to-open semantics.
- Forking after mmap.
MAP_PRIVATECoW pages duplicate per-process on first write — innocent-looking memory writes can blow up RSS in forked workers.
Frequently asked questions
Is mmap faster than read?
Sometimes. Sequential bulk reads with a large buffer often beat mmap because the syscall overhead amortizes and read-ahead is aggressive. Mmap wins for random access, partial reads of huge files, and zero-copy IPC. SQLite, LMDB and modern Lucene all use mmap for exactly that reason.
What's a page fault and is it expensive?
Accessing a mapped page that isn't yet resident traps into the kernel — a minor fault if the page is in cache (~1 µs) or major if it has to come from disk (~100 µs on SSD, ~10 ms on spinning rust). The first touch of every mapped page costs at least a minor fault.
MAP_SHARED vs MAP_PRIVATE — what's the difference?
MAP_SHARED writes go back to the underlying file and are visible to other mappers. MAP_PRIVATE uses copy-on-write: the first store on a page makes a private copy and the file is never modified. Loaders use PRIVATE for code, databases use SHARED for data files.
Can mmap fail because the file is too big?
On 32-bit systems, yes — your address space tops out at 2–3 GB of usable virtual range. On 64-bit, files routinely exceed RAM and mmap still works because pages fault in lazily, but the kernel can refuse if you hit RLIMIT_AS or system-wide overcommit limits.
What happens if the underlying file is truncated while mapped?
Touching a page beyond the new size raises SIGBUS. The kernel won't extend the mapping or zero-fill — it's the application's responsibility to keep the mapping and file size in sync, which is why databases use ftruncate before extending mappings.