Systems

The Out-of-Memory Killer

When the RAM runs out, the kernel picks a process and pulls the trigger

The Linux out-of-memory (OOM) killer is a kernel last resort that, when physical memory and swap are exhausted, scores every process by a badness heuristic and kills the one whose death frees the most memory while hurting the system least.

  • TriggerAllocation that can't be satisfied
  • Badness score range0 – 1000
  • oom_score_adj range−1000 … +1000
  • Victim selectionO(n) scan of tasks
  • Unkillableoom_score_adj = −1000

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How the OOM killer works

Linux is an optimistic landlord. When a program calls malloc() or fork(), the kernel hands back a valid pointer or a new address space without checking whether the physical RAM to back it actually exists. This is memory overcommit: the kernel bets that most allocated pages will never all be written at once, the same way an airline overbooks seats. Most of the time the bet pays off, because a freshly allocated page is just a promise — it only consumes real memory the moment your code writes to it and triggers a copy-on-write or demand-zero page fault.

The reckoning comes when the bet fails. Some process touches one more page, the page-fault handler needs a free physical frame, and the kernel's reclaim machinery — evicting clean file cache, writing dirty pages to disk, pushing anonymous pages to swap — comes back empty-handed. There is no free memory, no reclaimable memory, and no swap left. The kernel cannot fail a memory write the way malloc() can return NULL; by the time the page is being written, the C code has no error path. So instead of corrupting state or deadlocking forever, the kernel invokes out_of_memory() — the OOM killer.

Its job is brutal and simple: free memory by killing a process. But which process? Killing the tiny allocator that happened to trip the wire would free almost nothing and the system would be back in OOM milliseconds later. So the killer ignores who pulled the trigger and instead scans every task, scores each by a badness heuristic, and sends SIGKILL to the single worst offender — the one whose death buys the most breathing room.

The badness heuristic

The modern badness function (rewritten by David Rientjes in Linux 2.6.36, 2010, replacing the baroque heuristic-soup of the old killer) is deliberately almost-linear in memory footprint. For each task it computes:

points = get_mm_rss(mm)
       + get_mm_counter(mm, MM_SWAPENTS)
       + mm_pgtables_bytes(mm) / PAGE_SIZE        // resident + swapped + page tables, in pages

adj = oom_score_adj                                // -1000 .. +1000, per task
normalized = points * 1000 / (totalpages)          // scale to 0..1000 of RAM+swap
score = normalized + adj ...                        // adj biases on the same 0..1000 scale
                                                    // clamped so score >= 1 if not skipped

The intuition: badness ≈ fraction of system memory the process is using, on a 0–1000 scale, plus a tunable bias. A process pinning 40% of RAM+swap scores around 400 before adjustment. The oom_score_adj term is added in proportion to total memory, so an adjustment of +500 shoves a process roughly halfway up the scale regardless of its real size, and −1000 floors any process to a score the killer treats as "skip entirely."

You can read the live result. /proc/<pid>/oom_score shows the current normalized badness (0–1000), and /proc/<pid>/oom_score_adj shows the bias you can write to it. The selection itself is a single linear scan over the task list — O(n) in the number of processes, no fancy data structure, because by the time you're OOM you have bigger problems than a few thousand iterations.

Two important refinements. First, when running under cgroups, the killer can be scoped to a single memory cgroup that hit its memory.max limit — only tasks inside that group are candidates, so one container's leak kills one container. Second, cgroup v2's memory.oom.group flag turns the kill into a group operation: rather than killing one process and leaving an orphaned, half-working service, the kernel kills every task in the cgroup atomically.

When the killer fires (and when it shouldn't)

  • System-wide OOM. Global free memory and swap are exhausted and reclaim fails. The whole machine's tasks are scored.
  • cgroup OOM. A cgroup hits its hard memory.max. Only that cgroup's tasks are candidates — the rest of the box is untouched.
  • Overcommit set to "never." If you set vm.overcommit_memory = 2, the kernel refuses allocations beyond a strict limit, so malloc() returns NULL and well-written programs fail gracefully instead of being killed — at the cost of wasting memory you'd otherwise overcommit safely.

The killer is a last resort, not a strategy. By the time it fires, the machine has usually spent seconds to minutes thrashing — every page access faulting to disk — and feels frozen. That's why production systems increasingly kill earlier, in userspace, using pressure-stall information instead of waiting for total exhaustion. More on that in the variants below.

Kernel OOM killer vs userspace OOM daemons

Kernel OOM killersystemd-oomdearlyoomcgroup v2 memory.maxovercommit = never
Where it runsKernel, in fault pathUserspace daemonUserspace daemonKernel, per-cgroupKernel, at alloc time
Trigger signalAllocation cannot be satisfiedPSI memory pressure %Free RAM + swap thresholdscgroup exceeds limitCommit ceiling exceeded
When it actsToo late — after thrashingProactive, before exhaustionProactive, before exhaustionAt the limit boundaryBefore any RAM is touched
Kill granularityOne process (or cgroup w/ oom.group)Whole cgroup / sliceOne process by scoreTasks in the cgroupNo kill — alloc just fails
Latency to reliefSeconds–minutes (livelock)Sub-secondSub-secondImmediateImmediate
RiskSystem unresponsive firstMay kill too eagerlyCoarse thresholdsNeeds correct limits setWastes usable memory
Typical useAlways-on safety netModern desktops & Fedora serversLightweight servers, containersPer-container isolationLatency-critical / RT systems

The kernel killer can never be fully disabled — it's the floor that keeps the box from deadlocking. The userspace daemons sit above it, killing on softer signals so the kernel's brutal last resort almost never has to run.

What the numbers actually cost

  • Thrashing is a 100,000× slowdown. A page resident in RAM is read in ~100 ns. A page that must be faulted back from a SATA SSD costs ~100 µs, and from a spinning disk ~10 ms. When working set exceeds RAM, the machine spends nearly all its cycles servicing faults — effective throughput collapses by five or six orders of magnitude, which is why a near-OOM box looks hung.
  • The OOM scan is cheap. Scoring every task is O(n): even 10,000 processes is ~10,000 arithmetic evaluations, well under a millisecond. The expensive part is the minutes of reclaim and thrashing that preceded it — not the decision itself.
  • One bias value flips the verdict. A 16 GB box with a 12 GB Postgres process scores it near 750. Setting oom_score_adj = -900 on Postgres and +500 on a batch job means the 200 MB batch job (raw score ~12) now scores ~512 and dies first, sparing the database that was 60× larger.
  • The infamous DOOM line. The kernel log line "Out of memory: Killed process 1234 (chrome)" plus an oom_kill counter in /proc/vmstat are the only forensic trail — the killed process gets SIGKILL, so it runs no cleanup handler and writes no log of its own.

A model of victim selection in JavaScript

The core decision — score every process, pick the max — is small enough to model directly. This reproduces the modern badness math, including the −1000 "skip" floor and the proportional oom_score_adj bias:

const TOTAL_PAGES = 4_194_304;          // 16 GiB / 4 KiB pages

// Each process: pages of RSS + swap + page-table, and an adjustment.
function badness(proc, totalPages = TOTAL_PAGES) {
  // oom_score_adj = -1000 means "never kill": score floors at 0, skip.
  if (proc.oomScoreAdj <= -1000) return 0;

  const points = proc.rssPages + proc.swapPages + proc.pgtablePages;
  // Normalize footprint to 0..1000 of system memory.
  let score = Math.floor((points * 1000) / totalPages);
  // Bias is added in proportion to system size, then re-normalized.
  score += proc.oomScoreAdj;
  // Clamp into the reported range; a live candidate is at least 1.
  return Math.max(1, Math.min(1000, score));
}

function selectVictim(processes) {
  let victim = null, worst = -1;            // O(n) linear scan
  for (const p of processes) {
    if (p.pid === 1 || p.kernelThread) continue;   // never kill init or kthreads
    const s = badness(p);
    if (s > worst) { worst = s; victim = p; }
  }
  return victim;                            // gets SIGKILL
}

const procs = [
  { pid: 800,  name: 'postgres', rssPages: 3_145_728, swapPages: 0, pgtablePages: 6_000, oomScoreAdj: -900 },
  { pid: 900,  name: 'java',     rssPages:   786_432, swapPages: 65_536, pgtablePages: 2_000, oomScoreAdj: 0 },
  { pid: 1010, name: 'batch.py', rssPages:    52_428, swapPages: 0, pgtablePages:   200, oomScoreAdj: 500 },
];
console.log(selectVictim(procs).name);     // -> "batch.py": small, but +500 bias makes it the chosen victim

Note the structural facts the model captures: PID 1 (init/systemd) and kernel threads are unconditionally exempt — killing them would take down the system the killer is trying to save. And the victim is chosen by score, not by who allocated; here a 200 MB Python job dies while a 12 GB database survives, purely because of the adjustment bias.

Inspecting and steering the real killer in Python

On a live Linux box you don't reimplement the math — you read /proc, which exposes the kernel's own live scores. This script ranks the actual OOM candidates and lets you protect or sacrifice one:

import os

def read_int(path):
    try:
        with open(path) as f:
            return int(f.read().strip())
    except (OSError, ValueError):
        return None

def oom_candidates():
    rows = []
    for pid in filter(str.isdigit, os.listdir('/proc')):
        score = read_int(f'/proc/{pid}/oom_score')        # kernel's live badness, 0..1000
        adj   = read_int(f'/proc/{pid}/oom_score_adj')     # the bias, -1000..1000
        if score is None:
            continue                                       # process exited mid-scan
        try:
            name = open(f'/proc/{pid}/comm').read().strip()
        except OSError:
            continue
        rows.append((score, adj, int(pid), name))
    return sorted(rows, reverse=True)                      # highest score = first to die

def protect(pid):
    # -1000 makes the process effectively unkillable by the OOM killer.
    with open(f'/proc/{pid}/oom_score_adj', 'w') as f:
        f.write('-1000')

def sacrifice(pid):
    # +1000 makes the process the preferred victim.
    with open(f'/proc/{pid}/oom_score_adj', 'w') as f:
        f.write('1000')

if __name__ == '__main__':
    for score, adj, pid, name in oom_candidates()[:5]:
        print(f'{score:4d}  adj={adj:+5d}  pid={pid:<7d} {name}')

The same protection is available declaratively: a systemd unit can carry OOMScoreAdjust=-900 so the daemon is reborn protected on every restart, and ManagedOOMPreference=avoid tells systemd-oomd to prefer killing other cgroups first.

Variants and the move to userspace

systemd-oomd. A userspace daemon that watches pressure-stall information (PSI) — the fraction of wall-clock time tasks stall waiting on memory, exported via /proc/pressure/memory. When a cgroup's pressure or swap usage crosses a configured threshold, oomd kills the worst cgroup before the kernel killer would ever wake up. Default on Fedora since 34 and on modern Ubuntu desktops.

earlyoom. A tiny, dependency-free daemon that polls free RAM and swap; when both drop below percentage thresholds (e.g. 10% RAM, 10% swap) it picks the highest-oom_score process and kills it. Popular on lightweight servers and Raspberry Pis where the kernel's livelock is especially painful.

cgroup v2 memory.oom.group. Set this flag on a cgroup and an OOM event kills the whole group as one unit, not a single process. Essential for services where one orphaned worker is worse than a clean restart.

Overcommit policies. vm.overcommit_memory takes 0 (heuristic, the default), 1 (always allow — never refuse, used by Redis to avoid fork failures), or 2 (never overcommit — refuse beyond swap + overcommit_ratio·RAM, so allocations fail instead of triggering kills).

The OOM reaper. Added in Linux 4.6 (2016), a kernel thread (oom_reaper) that asynchronously reclaims the dying victim's anonymous memory without waiting for it to finish dying. Before it existed, a victim stuck in an uninterruptible state could leave the system OOM-deadlocked even after the kill decision was made.

Common bugs and edge cases

  • Blaming the victim. The process in the kill log is rarely the leak. It's just the biggest target. To find the real culprit, look at memory growth over time, not the single name in the OOM message.
  • Protecting a process by accident. Setting oom_score_adj = -1000 on a leaking service doesn't fix the leak — it forces the killer to murder innocent processes instead, and can deadlock the whole machine if the protected process is the one consuming everything.
  • Forgetting children inherit the adj. A child fork()ed from a process inherits its oom_score_adj. Protect a shell with −1000 and every command you run from it is protected too — usually not what you want.
  • Overcommit "never" surprises. Switching to vm.overcommit_memory = 2 makes fork() of a large process fail even when the child will immediately exec() a tiny program — the classic reason Redis recommends overcommit = 1.
  • The silent SIGKILL. Victims get signal 9, so they run no atexit handlers, flush no buffers, and write no crash log. A database killed mid-write relies entirely on its own write-ahead log to recover — the OOM killer offers no grace period.
  • Container limits vs node limits. A container can be OOM-killed by its cgroup memory.max while the host has gigabytes free. The kill log appears, but free -m on the node looks fine — the limit was local to the cgroup.

Frequently asked questions

Why does the OOM killer kill a process I never asked it to touch?

The killer doesn't target the process that requested the failing allocation — it targets whichever process has the highest badness score, which is roughly proportional to total resident memory. A small allocator can trigger the kill, but a huge, long-running memory hog like a database or a JVM is what actually dies, because killing it frees the most memory.

How is the OOM badness score calculated?

In modern kernels, badness = (RSS + swap + page-table pages) measured in pages, normalized to a 0–1000 scale against total RAM+swap, then oom_score_adj (also −1000 to +1000) is added. The process with the highest final score is killed. The score is recomputed live during the OOM event, not cached.

What is oom_score_adj and how do I protect a process?

/proc/<pid>/oom_score_adj is a tunable bias from −1000 to +1000 added to the badness score. Set it to −1000 to make a process effectively unkillable (its score floors at 0 and the killer skips it); set it to +1000 to make it the preferred victim. systemd exposes this as the OOMScoreAdjust= unit directive.

What is memory overcommit and why does it cause OOM kills?

Linux lets processes allocate more virtual memory than physically exists, betting that most pages won't all be touched at once. malloc and fork succeed optimistically. The reckoning comes when code actually writes to those pages — if real memory is gone by then, the kernel can't fail the write, so it invokes the OOM killer instead.

Why does my server freeze before the OOM killer fires?

Before declaring OOM, the kernel reclaims pages by evicting clean file-backed cache and thrashing swap. When free RAM is nearly gone, almost every page touched is a fault that must be read back from disk, so the machine spends nearly all its time in I/O and reclaim. This 'livelock' can last minutes; userspace daemons like earlyoom or systemd-oomd kill earlier to avoid it.

What is the difference between the kernel OOM killer and systemd-oomd?

The kernel OOM killer is a reactive last resort that only fires when an allocation literally cannot be satisfied — by which point the system may already be thrashing. systemd-oomd and earlyoom run in userspace and kill proactively while there is still slack: systemd-oomd watches pressure-stall (PSI) metrics, while earlyoom polls free RAM and swap thresholds. Both trade a slightly earlier kill for a responsive machine.