Systems
The Out-of-Memory Killer
When the RAM runs out, the kernel picks a process and pulls the trigger
The Linux out-of-memory (OOM) killer is a kernel last resort that, when physical memory and swap are exhausted, scores every process by a badness heuristic and kills the one whose death frees the most memory while hurting the system least.
- TriggerAllocation that can't be satisfied
- Badness score range0 – 1000
- oom_score_adj range−1000 … +1000
- Victim selectionO(n) scan of tasks
- Unkillableoom_score_adj = −1000
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How the OOM killer works
Linux is an optimistic landlord. When a program calls malloc() or fork(), the kernel hands back a valid pointer or a new address space without checking whether the physical RAM to back it actually exists. This is memory overcommit: the kernel bets that most allocated pages will never all be written at once, the same way an airline overbooks seats. Most of the time the bet pays off, because a freshly allocated page is just a promise — it only consumes real memory the moment your code writes to it and triggers a copy-on-write or demand-zero page fault.
The reckoning comes when the bet fails. Some process touches one more page, the page-fault handler needs a free physical frame, and the kernel's reclaim machinery — evicting clean file cache, writing dirty pages to disk, pushing anonymous pages to swap — comes back empty-handed. There is no free memory, no reclaimable memory, and no swap left. The kernel cannot fail a memory write the way malloc() can return NULL; by the time the page is being written, the C code has no error path. So instead of corrupting state or deadlocking forever, the kernel invokes out_of_memory() — the OOM killer.
Its job is brutal and simple: free memory by killing a process. But which process? Killing the tiny allocator that happened to trip the wire would free almost nothing and the system would be back in OOM milliseconds later. So the killer ignores who pulled the trigger and instead scans every task, scores each by a badness heuristic, and sends SIGKILL to the single worst offender — the one whose death buys the most breathing room.
The badness heuristic
The modern badness function (rewritten by David Rientjes in Linux 2.6.36, 2010, replacing the baroque heuristic-soup of the old killer) is deliberately almost-linear in memory footprint. For each task it computes:
points = get_mm_rss(mm)
+ get_mm_counter(mm, MM_SWAPENTS)
+ mm_pgtables_bytes(mm) / PAGE_SIZE // resident + swapped + page tables, in pages
adj = oom_score_adj // -1000 .. +1000, per task
normalized = points * 1000 / (totalpages) // scale to 0..1000 of RAM+swap
score = normalized + adj ... // adj biases on the same 0..1000 scale
// clamped so score >= 1 if not skipped
The intuition: badness ≈ fraction of system memory the process is using, on a 0–1000 scale, plus a tunable bias. A process pinning 40% of RAM+swap scores around 400 before adjustment. The oom_score_adj term is added in proportion to total memory, so an adjustment of +500 shoves a process roughly halfway up the scale regardless of its real size, and −1000 floors any process to a score the killer treats as "skip entirely."
You can read the live result. /proc/<pid>/oom_score shows the current normalized badness (0–1000), and /proc/<pid>/oom_score_adj shows the bias you can write to it. The selection itself is a single linear scan over the task list — O(n) in the number of processes, no fancy data structure, because by the time you're OOM you have bigger problems than a few thousand iterations.
Two important refinements. First, when running under cgroups, the killer can be scoped to a single memory cgroup that hit its memory.max limit — only tasks inside that group are candidates, so one container's leak kills one container. Second, cgroup v2's memory.oom.group flag turns the kill into a group operation: rather than killing one process and leaving an orphaned, half-working service, the kernel kills every task in the cgroup atomically.
When the killer fires (and when it shouldn't)
- System-wide OOM. Global free memory and swap are exhausted and reclaim fails. The whole machine's tasks are scored.
- cgroup OOM. A cgroup hits its hard
memory.max. Only that cgroup's tasks are candidates — the rest of the box is untouched. - Overcommit set to "never." If you set
vm.overcommit_memory = 2, the kernel refuses allocations beyond a strict limit, somalloc()returnsNULLand well-written programs fail gracefully instead of being killed — at the cost of wasting memory you'd otherwise overcommit safely.
The killer is a last resort, not a strategy. By the time it fires, the machine has usually spent seconds to minutes thrashing — every page access faulting to disk — and feels frozen. That's why production systems increasingly kill earlier, in userspace, using pressure-stall information instead of waiting for total exhaustion. More on that in the variants below.
Kernel OOM killer vs userspace OOM daemons
| Kernel OOM killer | systemd-oomd | earlyoom | cgroup v2 memory.max | overcommit = never | |
|---|---|---|---|---|---|
| Where it runs | Kernel, in fault path | Userspace daemon | Userspace daemon | Kernel, per-cgroup | Kernel, at alloc time |
| Trigger signal | Allocation cannot be satisfied | PSI memory pressure % | Free RAM + swap thresholds | cgroup exceeds limit | Commit ceiling exceeded |
| When it acts | Too late — after thrashing | Proactive, before exhaustion | Proactive, before exhaustion | At the limit boundary | Before any RAM is touched |
| Kill granularity | One process (or cgroup w/ oom.group) | Whole cgroup / slice | One process by score | Tasks in the cgroup | No kill — alloc just fails |
| Latency to relief | Seconds–minutes (livelock) | Sub-second | Sub-second | Immediate | Immediate |
| Risk | System unresponsive first | May kill too eagerly | Coarse thresholds | Needs correct limits set | Wastes usable memory |
| Typical use | Always-on safety net | Modern desktops & Fedora servers | Lightweight servers, containers | Per-container isolation | Latency-critical / RT systems |
The kernel killer can never be fully disabled — it's the floor that keeps the box from deadlocking. The userspace daemons sit above it, killing on softer signals so the kernel's brutal last resort almost never has to run.
What the numbers actually cost
- Thrashing is a 100,000× slowdown. A page resident in RAM is read in ~100 ns. A page that must be faulted back from a SATA SSD costs ~100 µs, and from a spinning disk ~10 ms. When working set exceeds RAM, the machine spends nearly all its cycles servicing faults — effective throughput collapses by five or six orders of magnitude, which is why a near-OOM box looks hung.
- The OOM scan is cheap. Scoring every task is O(n): even 10,000 processes is ~10,000 arithmetic evaluations, well under a millisecond. The expensive part is the minutes of reclaim and thrashing that preceded it — not the decision itself.
- One bias value flips the verdict. A 16 GB box with a 12 GB Postgres process scores it near 750. Setting
oom_score_adj = -900on Postgres and+500on a batch job means the 200 MB batch job (raw score ~12) now scores ~512 and dies first, sparing the database that was 60× larger. - The infamous DOOM line. The kernel log line "
Out of memory: Killed process 1234 (chrome)" plus anoom_killcounter in/proc/vmstatare the only forensic trail — the killed process getsSIGKILL, so it runs no cleanup handler and writes no log of its own.
A model of victim selection in JavaScript
The core decision — score every process, pick the max — is small enough to model directly. This reproduces the modern badness math, including the −1000 "skip" floor and the proportional oom_score_adj bias:
const TOTAL_PAGES = 4_194_304; // 16 GiB / 4 KiB pages
// Each process: pages of RSS + swap + page-table, and an adjustment.
function badness(proc, totalPages = TOTAL_PAGES) {
// oom_score_adj = -1000 means "never kill": score floors at 0, skip.
if (proc.oomScoreAdj <= -1000) return 0;
const points = proc.rssPages + proc.swapPages + proc.pgtablePages;
// Normalize footprint to 0..1000 of system memory.
let score = Math.floor((points * 1000) / totalPages);
// Bias is added in proportion to system size, then re-normalized.
score += proc.oomScoreAdj;
// Clamp into the reported range; a live candidate is at least 1.
return Math.max(1, Math.min(1000, score));
}
function selectVictim(processes) {
let victim = null, worst = -1; // O(n) linear scan
for (const p of processes) {
if (p.pid === 1 || p.kernelThread) continue; // never kill init or kthreads
const s = badness(p);
if (s > worst) { worst = s; victim = p; }
}
return victim; // gets SIGKILL
}
const procs = [
{ pid: 800, name: 'postgres', rssPages: 3_145_728, swapPages: 0, pgtablePages: 6_000, oomScoreAdj: -900 },
{ pid: 900, name: 'java', rssPages: 786_432, swapPages: 65_536, pgtablePages: 2_000, oomScoreAdj: 0 },
{ pid: 1010, name: 'batch.py', rssPages: 52_428, swapPages: 0, pgtablePages: 200, oomScoreAdj: 500 },
];
console.log(selectVictim(procs).name); // -> "batch.py": small, but +500 bias makes it the chosen victim
Note the structural facts the model captures: PID 1 (init/systemd) and kernel threads are unconditionally exempt — killing them would take down the system the killer is trying to save. And the victim is chosen by score, not by who allocated; here a 200 MB Python job dies while a 12 GB database survives, purely because of the adjustment bias.
Inspecting and steering the real killer in Python
On a live Linux box you don't reimplement the math — you read /proc, which exposes the kernel's own live scores. This script ranks the actual OOM candidates and lets you protect or sacrifice one:
import os
def read_int(path):
try:
with open(path) as f:
return int(f.read().strip())
except (OSError, ValueError):
return None
def oom_candidates():
rows = []
for pid in filter(str.isdigit, os.listdir('/proc')):
score = read_int(f'/proc/{pid}/oom_score') # kernel's live badness, 0..1000
adj = read_int(f'/proc/{pid}/oom_score_adj') # the bias, -1000..1000
if score is None:
continue # process exited mid-scan
try:
name = open(f'/proc/{pid}/comm').read().strip()
except OSError:
continue
rows.append((score, adj, int(pid), name))
return sorted(rows, reverse=True) # highest score = first to die
def protect(pid):
# -1000 makes the process effectively unkillable by the OOM killer.
with open(f'/proc/{pid}/oom_score_adj', 'w') as f:
f.write('-1000')
def sacrifice(pid):
# +1000 makes the process the preferred victim.
with open(f'/proc/{pid}/oom_score_adj', 'w') as f:
f.write('1000')
if __name__ == '__main__':
for score, adj, pid, name in oom_candidates()[:5]:
print(f'{score:4d} adj={adj:+5d} pid={pid:<7d} {name}')
The same protection is available declaratively: a systemd unit can carry OOMScoreAdjust=-900 so the daemon is reborn protected on every restart, and ManagedOOMPreference=avoid tells systemd-oomd to prefer killing other cgroups first.
Variants and the move to userspace
systemd-oomd. A userspace daemon that watches pressure-stall information (PSI) — the fraction of wall-clock time tasks stall waiting on memory, exported via /proc/pressure/memory. When a cgroup's pressure or swap usage crosses a configured threshold, oomd kills the worst cgroup before the kernel killer would ever wake up. Default on Fedora since 34 and on modern Ubuntu desktops.
earlyoom. A tiny, dependency-free daemon that polls free RAM and swap; when both drop below percentage thresholds (e.g. 10% RAM, 10% swap) it picks the highest-oom_score process and kills it. Popular on lightweight servers and Raspberry Pis where the kernel's livelock is especially painful.
cgroup v2 memory.oom.group. Set this flag on a cgroup and an OOM event kills the whole group as one unit, not a single process. Essential for services where one orphaned worker is worse than a clean restart.
Overcommit policies. vm.overcommit_memory takes 0 (heuristic, the default), 1 (always allow — never refuse, used by Redis to avoid fork failures), or 2 (never overcommit — refuse beyond swap + overcommit_ratio·RAM, so allocations fail instead of triggering kills).
The OOM reaper. Added in Linux 4.6 (2016), a kernel thread (oom_reaper) that asynchronously reclaims the dying victim's anonymous memory without waiting for it to finish dying. Before it existed, a victim stuck in an uninterruptible state could leave the system OOM-deadlocked even after the kill decision was made.
Common bugs and edge cases
- Blaming the victim. The process in the kill log is rarely the leak. It's just the biggest target. To find the real culprit, look at memory growth over time, not the single name in the OOM message.
- Protecting a process by accident. Setting
oom_score_adj = -1000on a leaking service doesn't fix the leak — it forces the killer to murder innocent processes instead, and can deadlock the whole machine if the protected process is the one consuming everything. - Forgetting children inherit the adj. A child
fork()ed from a process inherits itsoom_score_adj. Protect a shell with −1000 and every command you run from it is protected too — usually not what you want. - Overcommit "never" surprises. Switching to
vm.overcommit_memory = 2makesfork()of a large process fail even when the child will immediatelyexec()a tiny program — the classic reason Redis recommends overcommit = 1. - The silent SIGKILL. Victims get signal 9, so they run no atexit handlers, flush no buffers, and write no crash log. A database killed mid-write relies entirely on its own write-ahead log to recover — the OOM killer offers no grace period.
- Container limits vs node limits. A container can be OOM-killed by its cgroup
memory.maxwhile the host has gigabytes free. The kill log appears, butfree -mon the node looks fine — the limit was local to the cgroup.
Frequently asked questions
Why does the OOM killer kill a process I never asked it to touch?
The killer doesn't target the process that requested the failing allocation — it targets whichever process has the highest badness score, which is roughly proportional to total resident memory. A small allocator can trigger the kill, but a huge, long-running memory hog like a database or a JVM is what actually dies, because killing it frees the most memory.
How is the OOM badness score calculated?
In modern kernels, badness = (RSS + swap + page-table pages) measured in pages, normalized to a 0–1000 scale against total RAM+swap, then oom_score_adj (also −1000 to +1000) is added. The process with the highest final score is killed. The score is recomputed live during the OOM event, not cached.
What is oom_score_adj and how do I protect a process?
/proc/<pid>/oom_score_adj is a tunable bias from −1000 to +1000 added to the badness score. Set it to −1000 to make a process effectively unkillable (its score floors at 0 and the killer skips it); set it to +1000 to make it the preferred victim. systemd exposes this as the OOMScoreAdjust= unit directive.
What is memory overcommit and why does it cause OOM kills?
Linux lets processes allocate more virtual memory than physically exists, betting that most pages won't all be touched at once. malloc and fork succeed optimistically. The reckoning comes when code actually writes to those pages — if real memory is gone by then, the kernel can't fail the write, so it invokes the OOM killer instead.
Why does my server freeze before the OOM killer fires?
Before declaring OOM, the kernel reclaims pages by evicting clean file-backed cache and thrashing swap. When free RAM is nearly gone, almost every page touched is a fault that must be read back from disk, so the machine spends nearly all its time in I/O and reclaim. This 'livelock' can last minutes; userspace daemons like earlyoom or systemd-oomd kill earlier to avoid it.
What is the difference between the kernel OOM killer and systemd-oomd?
The kernel OOM killer is a reactive last resort that only fires when an allocation literally cannot be satisfied — by which point the system may already be thrashing. systemd-oomd and earlyoom run in userspace and kill proactively while there is still slack: systemd-oomd watches pressure-stall (PSI) metrics, while earlyoom polls free RAM and swap thresholds. Both trade a slightly earlier kill for a responsive machine.