Systems

seccomp Sandboxing

Shrink the kernel's attack surface to a syscall allow-list

seccomp is a Linux kernel feature that confines a process to a whitelist of system calls, so a compromised program can't open files, spawn shells, or touch the network — the kernel kills it the moment it tries a syscall outside its allow-list.

  • IntroducedLinux 2.6.12 (2005); BPF mode 3.5 (2012)
  • Filter languageclassic BPF (cBPF)
  • Per-syscall cost≈ 50–100 ns
  • Inspectable argssyscall # + 6 registers
  • Prerequisiteno_new_privs or CAP_SYS_ADMIN

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: a doorman for system calls

A user-space program is mostly harmless on its own. It can do arithmetic, move bytes around its own memory, and loop forever — but it cannot delete a file, open a socket, or kill another process. Every one of those powers lives in the kernel and is reached through a system call: the program loads a number into a register (say 2 for open on x86-64), puts the arguments in other registers, and traps into the kernel.

That single chokepoint is the whole game. If an attacker exploits a bug in your image parser or your PDF renderer and gets code execution, the only way that code can hurt the rest of the machine is by making system calls. seccomp — short for secure computing mode — installs a doorman at that chokepoint. You hand the kernel a small program: "for syscall number N with arguments A, B, C, return ALLOW, or KILL, or fake an error." From then on, every syscall the process makes is checked by the doorman before the kernel does any work. A renderer that only needs read, write, mmap, and exit can be locked to exactly those four, and a hijacked renderer that tries execve("/bin/sh") is killed mid-syscall.

The mental model is a firewall, but for the kernel's API instead of the network. The point is not to make exploitation impossible — it's to shrink the attack surface. Linux exposes around 350 syscalls; a typical sandboxed worker needs 40–60. Forbidding the other ~300 removes ~300 doors an attacker could have rattled looking for a kernel bug to escalate through.

How it works: seccomp-BPF

The original 2005 mode (now called SECCOMP_MODE_STRICT) allowed exactly four syscalls — read, write, _exit, sigreturn — and nothing else. Too blunt to be useful for real programs. The modern, flexible mode arrived in Linux 3.5 (2012): seccomp-BPF, contributed largely by Will Drewry at Google for the Chrome sandbox.

The filter is a program written in classic BPF (cBPF) — the same tiny virtual machine that originally powered tcpdump packet filters, repurposed to filter syscalls instead of packets. On each syscall, the kernel populates a read-only struct and runs your BPF program against it:

struct seccomp_data {
    int   nr;                  /* syscall number          */
    __u32 arch;                /* AUDIT_ARCH_* of the ABI */
    __u64 instruction_pointer; /* where the call came from*/
    __u64 args[6];             /* the six register args   */
};

The BPF program loads fields out of this struct, compares them, and ends by returning a 32-bit action. The kernel reads the top bits of that return value to decide what to do. The actions, from most to least severe:

  • SECCOMP_RET_KILL_PROCESS — kill the whole process immediately (since 4.14; KILL_THREAD kills just the thread).
  • SECCOMP_RET_TRAP — send SIGSYS; a handler can inspect and even emulate the call.
  • SECCOMP_RET_ERRNO — don't run the syscall; return the chosen errno (e.g. EPERM) as if it had failed.
  • SECCOMP_RET_USER_NOTIF — hand the call to a userspace supervisor over a notify fd (since 5.0).
  • SECCOMP_RET_TRACE — defer to an attached ptrace tracer.
  • SECCOMP_RET_LOG — allow but record it (audit-friendly).
  • SECCOMP_RET_ALLOW — let it through untouched.

Filters are installed with prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) or the newer seccomp(2) syscall. Critically, filters are stackable and one-way: you can add more filters, never remove one, and they're inherited across fork and execve. When multiple filters are installed, all of them run and the most restrictive action wins (KILL beats ERRNO beats ALLOW). That monotonicity is a security property: a child can never loosen what its parent locked down.

The decision rule, precisely

Because a classic BPF filter is just a linear sequence of instructions with forward-only jumps, it always halts, and its evaluation cost is bounded by its length — there are no loops. For a filter of m instructions the per-syscall check is O(m), and a hand-written allow-list that compares against k syscall numbers one at a time is O(k). (A smarter filter can binary-search a sorted syscall table in O(log k), which is what libseccomp does internally.)

When several filters are layered, with filter i of length m_i, every syscall pays the sum:

action = max_severity( f_1(data), f_2(data), …, f_n(data) )
cost   = O(m_1 + m_2 + … + m_n)   per syscall

where max_severity takes the numerically smallest return value (KILL = 0x0000, ERRNO = 0x0005…, ALLOW = 0x7fff0000 — lower is stricter). The key invariant: the action is a pure function of (nr, arch, args) only. It can't depend on global state, time, or — crucially — the contents of memory that the arguments point to. That single restriction is what makes seccomp fast and race-free, and it's also its biggest limitation, which we'll return to under bugs.

When to use seccomp (and when not to)

  • Untrusted input parsers. Image, font, video, and document decoders are the classic case — Chrome's renderer and many media pipelines run their parsers under tight seccomp filters.
  • Container and microVM hardening. Docker, Podman, and Kubernetes ship a default profile that blocks ~44 dangerous syscalls. gVisor and Firecracker wrap their entire monitor in seccomp.
  • systemd services. A one-line SystemCallFilter=@system-service in a unit file confines a daemon with no code changes.
  • Plugin / WASM-host isolation. Any process running third-party code benefits from a syscall ceiling.

seccomp is the wrong tool when you need to filter on which file or which host — that's path- and resource-level policy, the domain of cgroups, mount namespaces, capabilities, Landlock, or an LSM (AppArmor/SELinux). It's also wrong as a sole defense: it shrinks the kernel API but doesn't isolate the filesystem, network, or PID space. Use it as one layer in a defense-in-depth stack, never the only one.

seccomp vs other Linux sandboxing primitives

seccomp-BPFCapabilitiesNamespacescgroupsLandlockSELinux/AppArmor
What it restrictswhich syscalls runwhich privileged ops a syscall may dowhat the process can see (PIDs, mounts, net)how many resources it may usewhich files/dirs it may accessfull mandatory access control
Granularitysyscall # + scalar args41 fixed capability bits8 namespace typesCPU, mem, I/O, pids…filesystem pathsarbitrary labels & rules
Inspect pointer contents?no (TOCTOU-safe)n/an/an/ayes (path-based)yes
Per-call overhead~50–100 nsnegligiblesetup-time onlyaccounting onlypath-walk costpolicy-lookup cost
Unprivileged useyes (with no_new_privs)partlyuser namespacesdelegated cgroupyesno (admin policy)
Primary goalshrink kernel attack surfacedrop ambient root powersisolate the viewcap resource usageconfine file accesssystem-wide MAC

The headline: these are complementary, not competing. A real container uses all of them at once — namespaces for isolation, cgroups for resource limits, dropped capabilities, a seccomp profile to cut the syscall surface, and often an LSM on top. seccomp's unique value is being the only one that filters at the granularity of individual syscalls.

What the numbers actually say

  • ~350 syscalls exist; a sandboxed worker needs 40–60. Docker's default profile allows roughly 300 and blocks ~44 dangerous ones (e.g. mount, reboot, ptrace, kexec_load, bpf). Aggressive profiles cut far deeper.
  • ~50–100 ns added per syscall for a short filter — a few BPF instructions. On a syscall-heavy workload (millions/sec) this is measurable; on a typical service it's noise. The Linux 5.11 constant-action bitmap fast-path makes argument-independent filters effectively free.
  • One mistake = 100% of the protection lost. A 2014 study of real-world filters and the recurring "32-bit ABI bypass" show that a single forgotten arch check or an allowed ptrace/socket can hand an attacker a clean path out. seccomp's value is binary per-syscall but brittle per-filter.
  • Filters are inherited and irrevocable. Zero runtime cost to keep them after fork/exec, but you must install everything before running untrusted code — there is no "undo."

A filter evaluator in JavaScript

You don't write seccomp filters in JavaScript, but modeling the BPF evaluator makes the mechanism concrete. Here we represent the policy as a list of rules and reproduce the "most-restrictive-wins, syscall + arg only" semantics of the kernel:

// Actions ordered by severity (lower index = stricter, like the kernel's return-value ordering)
const KILL = 0, TRAP = 1, ERRNO = 2, ALLOW = 3;

const EXPECTED_ARCH = "x86_64";

// A filter is an ordered list of rules. First match wins, like cBPF's forward jumps.
function makeFilter(rules, defaultAction = KILL) {
  return function evaluate({ nr, arch, args }) {
    // RULE ZERO: pin the architecture, or the syscall number is meaningless.
    if (arch !== EXPECTED_ARCH) return KILL;
    for (const r of rules) {
      if (r.nr !== nr) continue;
      // Optional scalar-argument predicate (can't read pointed-to memory!).
      if (r.argCheck && !r.argCheck(args)) continue;
      return r.action;
    }
    return defaultAction;            // closed by default: deny everything unlisted
  };
}

// Layered filters: every filter runs, the STRICTEST action wins.
function evaluateStack(filters, data) {
  return filters.reduce((acc, f) => Math.min(acc, f(data)), ALLOW);
}

// Example: a renderer that may read/write/mmap, may exit, and may call
// fcntl ONLY with F_GETFL (cmd === 3). Everything else is killed.
const SYS = { read: 0, write: 1, mmap: 9, fcntl: 72, exit_group: 231, execve: 59 };
const renderer = makeFilter([
  { nr: SYS.read,       action: ALLOW },
  { nr: SYS.write,      action: ALLOW },
  { nr: SYS.mmap,       action: ALLOW },
  { nr: SYS.fcntl,      action: ALLOW, argCheck: (a) => a[1] === 3 },
  { nr: SYS.exit_group, action: ALLOW },
]);

console.log(renderer({ nr: SYS.read,   arch: "x86_64", args: [3, 0, 4096] })); // 3 ALLOW
console.log(renderer({ nr: SYS.execve, arch: "x86_64", args: [] }));           // 0 KILL
console.log(renderer({ nr: SYS.fcntl,  arch: "x86_64", args: [3, 4] }));       // 0 KILL (cmd 4 = F_SETFL)
console.log(renderer({ nr: SYS.read,   arch: "x86",    args: [] }));           // 0 KILL (wrong ABI!)

Two details mirror the real kernel exactly. First, the arch check is rule zero — drop it and the whole filter is bypassable by switching ABIs. Second, argCheck only ever inspects scalar register values (a[1]), never the bytes a pointer points at — that's the TOCTOU-safety boundary that classic seccomp cannot cross.

A real, working filter in Python

This actually runs on Linux using pyseccomp (the official libseccomp binding; pip install pyseccomp). It builds a closed-by-default allow-list, demonstrates an argument-level rule, then proves the sandbox by attempting a forbidden call:

import os, sys, errno
import pyseccomp as seccomp

def install_sandbox():
    # Default action for anything NOT explicitly allowed: kill the process.
    f = seccomp.SyscallFilter(defaction=seccomp.KILL_PROCESS)

    # The bare minimum to print and exit cleanly.
    for name in ("write", "read", "exit", "exit_group",
                 "rt_sigreturn", "brk", "mmap", "munmap", "fstat"):
        f.add_rule(seccomp.ALLOW, name)

    # Argument-level rule: allow fcntl ONLY when cmd == F_GETFL (3).
    # Note we compare a SCALAR register, never a pointer's contents.
    f.add_rule(seccomp.ALLOW, "fcntl",
               seccomp.Arg(1, seccomp.EQ, 3))

    # Let openat fail softly with EPERM instead of killing — handy when a
    # library probes for an optional file and can cope with failure.
    f.add_rule(seccomp.ERRNO(errno.EPERM), "openat")

    f.load()   # one-way: from here on the kernel enforces the filter

if __name__ == "__main__":
    # On a kernel without CAP_SYS_ADMIN, libseccomp sets no_new_privs for us.
    install_sandbox()
    print("inside the sandbox: write() is allowed")

    # Soft-denied: returns -1/EPERM rather than crashing.
    try:
        os.open("/etc/passwd", os.O_RDONLY)
    except PermissionError:
        print("openat blocked with EPERM, as configured")

    # Hard-denied: socket() is not on the allow-list -> SIGSYS, process dies here.
    import socket
    socket.socket(socket.AF_INET, socket.SOCK_STREAM)  # never returns
    print("this line is unreachable")

Run it and you'll see the two allowed lines print, then the process is killed by SIGSYS at the socket() call — the final print never executes. libseccomp quietly handles the multi-arch and BPF-codegen pitfalls for you, which is why hand-rolling raw cBPF is discouraged outside of teaching.

Variants and the wider ecosystem

seccomp strict mode. The 2005 original: prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT), four syscalls, no configuration. Still used by some compute-only sandboxes that genuinely need nothing else.

User notification (seccomp_unotify). Since Linux 5.0, SECCOMP_RET_USER_NOTIF parks the target on a forbidden call and hands a supervisor process the chance to inspect, emulate, or fulfill it — even reading the target's memory via /proc/pid/mem to safely check a path. This is how some runtimes emulate syscalls (e.g. brokered file opens) without granting them.

libseccomp. The portable C library (with bindings for Python, Go, Rust) that compiles a high-level allow/deny list into optimized, multi-arch cBPF, binary-searching sorted syscall tables for O(log k) dispatch. Almost everyone uses this rather than raw BPF.

Landlock. A newer (Linux 5.13, 2021) unprivileged LSM that does what seccomp cannot — restrict access by filesystem path — and is meant to sit alongside seccomp, not replace it.

BPF LSM and eBPF. seccomp uses classic BPF; the modern extended BPF powers richer, programmable security hooks via the BPF LSM, but seccomp deliberately sticks with the simpler, easier-to-audit cBPF on the hot syscall path.

Common bugs and edge cases

  • Forgetting the architecture check. The number-to-syscall mapping differs between x86-64 and x86-32. A filter that validates only 64-bit numbers is bypassed by issuing the 32-bit ABI variant. Always reject any arch that isn't your expected AUDIT_ARCH value as rule zero.
  • Trying to filter on pointer contents. You cannot safely read the path passed to open in a cBPF filter — between your check and the syscall another thread can overwrite that memory (TOCTOU). Use seccomp_unotify or Landlock for path policy.
  • Allowing the wrong "harmless" syscall. ptrace lets a process commandeer a sibling; socket reopens the network; clone with the right flags creates a privileged context. Audit the allow-list for escalation primitives.
  • Multiplexed syscalls. On 32-bit x86, socketcall and ipc bundle many operations behind one number, so a coarse rule lets all of them through. Newer kernels expose the individual calls; prefer those.
  • Installing the filter too late. Filters apply only to syscalls made after load(). Initialize everything — open log files, dlopen libraries, resolve DNS — first, then drop into the sandbox before touching untrusted input.
  • Killing on a syscall the C library needs. glibc may call rt_sigreturn, futex, brk, or newer variants like clock_gettime64 you didn't anticipate; an over-tight KILL filter crashes on innocuous internal calls. Test under the real libc and consider SECCOMP_RET_LOG first to discover the true set.

Frequently asked questions

What's the difference between SECCOMP_RET_KILL and SECCOMP_RET_ERRNO?

KILL terminates the process (SIGSYS) the instant it makes a forbidden call, which is the safest default. ERRNO lets the call return as if it failed with a chosen error like EPERM, so the program keeps running — useful when a library does an optional syscall you'd rather see fail gracefully than crash on. There's also TRAP (raise SIGSYS so a handler can fake a result), TRACE (hand off to a ptrace supervisor), LOG (record and allow), and ALLOW.

Can a seccomp filter inspect the contents of a pointer argument, like a file path?

No. A classic seccomp-bpf filter can only read the syscall number and the six register arguments — it cannot dereference a pointer, because the value could be changed by another thread between the check and the call (a TOCTOU race). To filter on a path you need seccomp_unotify (SECCOMP_RET_USER_NOTIF), where a supervisor process reads the target's memory through /proc/pid/mem after pinning it, or a fuller mechanism like Landlock or an LSM.

How much does a seccomp filter slow a program down?

A short filter adds roughly 50–100 nanoseconds per syscall — the cost of running a handful of BPF instructions on entry. The classic filter is a linear list, so cost grows with the number of comparisons; long filters that compare hundreds of syscall numbers can dominate. Since Linux 5.11 a constant-action bitmap fast-path lets the kernel answer filters whose action doesn't depend on the arguments without running BPF at all, making the common case nearly free.

Why does adding a seccomp filter require either root or no_new_privs?

Without it, an unprivileged process could install a filter that returns ERRNO for setuid, then exec a setuid-root binary that silently mis-behaves because it thinks dropping privileges succeeded. Setting PR_SET_NO_NEW_PRIVS first guarantees no exec can ever gain privileges, closing that escalation path, so the kernel then lets any process install a filter.

Is seccomp a complete sandbox on its own?

No — it's one layer. seccomp restricts which syscalls are reachable but says nothing about which files, networks, or users a permitted syscall can touch. Real sandboxes (Docker, gVisor, Chrome, systemd units) stack seccomp with namespaces, cgroups, capabilities, mount restrictions, and an LSM like AppArmor or SELinux. seccomp's job is to shrink the kernel attack surface — the number of syscalls an attacker can use to find a kernel bug.

What's the most common bug when writing a seccomp filter by hand?

Forgetting that the syscall number alone is meaningless without checking the architecture. The same number means different syscalls on x86-64 vs x86-32, so an attacker on a multi-arch kernel can flip to the 32-bit ABI and slip past a filter that only validated 64-bit numbers. Every robust filter starts by checking data.arch and killing anything that isn't the expected AUDIT_ARCH value.