Systems
seccomp Sandboxing
Shrink the kernel's attack surface to a syscall allow-list
seccomp is a Linux kernel feature that confines a process to a whitelist of system calls, so a compromised program can't open files, spawn shells, or touch the network — the kernel kills it the moment it tries a syscall outside its allow-list.
- IntroducedLinux 2.6.12 (2005); BPF mode 3.5 (2012)
- Filter languageclassic BPF (cBPF)
- Per-syscall cost≈ 50–100 ns
- Inspectable argssyscall # + 6 registers
- Prerequisiteno_new_privs or CAP_SYS_ADMIN
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The intuition: a doorman for system calls
A user-space program is mostly harmless on its own. It can do arithmetic, move bytes around its own memory, and loop forever — but it cannot delete a file, open a socket, or kill another process. Every one of those powers lives in the kernel and is reached through a system call: the program loads a number into a register (say 2 for open on x86-64), puts the arguments in other registers, and traps into the kernel.
That single chokepoint is the whole game. If an attacker exploits a bug in your image parser or your PDF renderer and gets code execution, the only way that code can hurt the rest of the machine is by making system calls. seccomp — short for secure computing mode — installs a doorman at that chokepoint. You hand the kernel a small program: "for syscall number N with arguments A, B, C, return ALLOW, or KILL, or fake an error." From then on, every syscall the process makes is checked by the doorman before the kernel does any work. A renderer that only needs read, write, mmap, and exit can be locked to exactly those four, and a hijacked renderer that tries execve("/bin/sh") is killed mid-syscall.
The mental model is a firewall, but for the kernel's API instead of the network. The point is not to make exploitation impossible — it's to shrink the attack surface. Linux exposes around 350 syscalls; a typical sandboxed worker needs 40–60. Forbidding the other ~300 removes ~300 doors an attacker could have rattled looking for a kernel bug to escalate through.
How it works: seccomp-BPF
The original 2005 mode (now called SECCOMP_MODE_STRICT) allowed exactly four syscalls — read, write, _exit, sigreturn — and nothing else. Too blunt to be useful for real programs. The modern, flexible mode arrived in Linux 3.5 (2012): seccomp-BPF, contributed largely by Will Drewry at Google for the Chrome sandbox.
The filter is a program written in classic BPF (cBPF) — the same tiny virtual machine that originally powered tcpdump packet filters, repurposed to filter syscalls instead of packets. On each syscall, the kernel populates a read-only struct and runs your BPF program against it:
struct seccomp_data {
int nr; /* syscall number */
__u32 arch; /* AUDIT_ARCH_* of the ABI */
__u64 instruction_pointer; /* where the call came from*/
__u64 args[6]; /* the six register args */
};
The BPF program loads fields out of this struct, compares them, and ends by returning a 32-bit action. The kernel reads the top bits of that return value to decide what to do. The actions, from most to least severe:
SECCOMP_RET_KILL_PROCESS— kill the whole process immediately (since 4.14;KILL_THREADkills just the thread).SECCOMP_RET_TRAP— sendSIGSYS; a handler can inspect and even emulate the call.SECCOMP_RET_ERRNO— don't run the syscall; return the chosen errno (e.g.EPERM) as if it had failed.SECCOMP_RET_USER_NOTIF— hand the call to a userspace supervisor over a notify fd (since 5.0).SECCOMP_RET_TRACE— defer to an attachedptracetracer.SECCOMP_RET_LOG— allow but record it (audit-friendly).SECCOMP_RET_ALLOW— let it through untouched.
Filters are installed with prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) or the newer seccomp(2) syscall. Critically, filters are stackable and one-way: you can add more filters, never remove one, and they're inherited across fork and execve. When multiple filters are installed, all of them run and the most restrictive action wins (KILL beats ERRNO beats ALLOW). That monotonicity is a security property: a child can never loosen what its parent locked down.
The decision rule, precisely
Because a classic BPF filter is just a linear sequence of instructions with forward-only jumps, it always halts, and its evaluation cost is bounded by its length — there are no loops. For a filter of m instructions the per-syscall check is O(m), and a hand-written allow-list that compares against k syscall numbers one at a time is O(k). (A smarter filter can binary-search a sorted syscall table in O(log k), which is what libseccomp does internally.)
When several filters are layered, with filter i of length m_i, every syscall pays the sum:
action = max_severity( f_1(data), f_2(data), …, f_n(data) )
cost = O(m_1 + m_2 + … + m_n) per syscall
where max_severity takes the numerically smallest return value (KILL = 0x0000, ERRNO = 0x0005…, ALLOW = 0x7fff0000 — lower is stricter). The key invariant: the action is a pure function of (nr, arch, args) only. It can't depend on global state, time, or — crucially — the contents of memory that the arguments point to. That single restriction is what makes seccomp fast and race-free, and it's also its biggest limitation, which we'll return to under bugs.
When to use seccomp (and when not to)
- Untrusted input parsers. Image, font, video, and document decoders are the classic case — Chrome's renderer and many media pipelines run their parsers under tight seccomp filters.
- Container and microVM hardening. Docker, Podman, and Kubernetes ship a default profile that blocks ~44 dangerous syscalls. gVisor and Firecracker wrap their entire monitor in seccomp.
- systemd services. A one-line
SystemCallFilter=@system-servicein a unit file confines a daemon with no code changes. - Plugin / WASM-host isolation. Any process running third-party code benefits from a syscall ceiling.
seccomp is the wrong tool when you need to filter on which file or which host — that's path- and resource-level policy, the domain of cgroups, mount namespaces, capabilities, Landlock, or an LSM (AppArmor/SELinux). It's also wrong as a sole defense: it shrinks the kernel API but doesn't isolate the filesystem, network, or PID space. Use it as one layer in a defense-in-depth stack, never the only one.
seccomp vs other Linux sandboxing primitives
| seccomp-BPF | Capabilities | Namespaces | cgroups | Landlock | SELinux/AppArmor | |
|---|---|---|---|---|---|---|
| What it restricts | which syscalls run | which privileged ops a syscall may do | what the process can see (PIDs, mounts, net) | how many resources it may use | which files/dirs it may access | full mandatory access control |
| Granularity | syscall # + scalar args | 41 fixed capability bits | 8 namespace types | CPU, mem, I/O, pids… | filesystem paths | arbitrary labels & rules |
| Inspect pointer contents? | no (TOCTOU-safe) | n/a | n/a | n/a | yes (path-based) | yes |
| Per-call overhead | ~50–100 ns | negligible | setup-time only | accounting only | path-walk cost | policy-lookup cost |
| Unprivileged use | yes (with no_new_privs) | partly | user namespaces | delegated cgroup | yes | no (admin policy) |
| Primary goal | shrink kernel attack surface | drop ambient root powers | isolate the view | cap resource usage | confine file access | system-wide MAC |
The headline: these are complementary, not competing. A real container uses all of them at once — namespaces for isolation, cgroups for resource limits, dropped capabilities, a seccomp profile to cut the syscall surface, and often an LSM on top. seccomp's unique value is being the only one that filters at the granularity of individual syscalls.
What the numbers actually say
- ~350 syscalls exist; a sandboxed worker needs 40–60. Docker's default profile allows roughly 300 and blocks ~44 dangerous ones (e.g.
mount,reboot,ptrace,kexec_load,bpf). Aggressive profiles cut far deeper. - ~50–100 ns added per syscall for a short filter — a few BPF instructions. On a syscall-heavy workload (millions/sec) this is measurable; on a typical service it's noise. The Linux 5.11 constant-action bitmap fast-path makes argument-independent filters effectively free.
- One mistake = 100% of the protection lost. A 2014 study of real-world filters and the recurring "32-bit ABI bypass" show that a single forgotten
archcheck or an allowedptrace/socketcan hand an attacker a clean path out. seccomp's value is binary per-syscall but brittle per-filter. - Filters are inherited and irrevocable. Zero runtime cost to keep them after
fork/exec, but you must install everything before running untrusted code — there is no "undo."
A filter evaluator in JavaScript
You don't write seccomp filters in JavaScript, but modeling the BPF evaluator makes the mechanism concrete. Here we represent the policy as a list of rules and reproduce the "most-restrictive-wins, syscall + arg only" semantics of the kernel:
// Actions ordered by severity (lower index = stricter, like the kernel's return-value ordering)
const KILL = 0, TRAP = 1, ERRNO = 2, ALLOW = 3;
const EXPECTED_ARCH = "x86_64";
// A filter is an ordered list of rules. First match wins, like cBPF's forward jumps.
function makeFilter(rules, defaultAction = KILL) {
return function evaluate({ nr, arch, args }) {
// RULE ZERO: pin the architecture, or the syscall number is meaningless.
if (arch !== EXPECTED_ARCH) return KILL;
for (const r of rules) {
if (r.nr !== nr) continue;
// Optional scalar-argument predicate (can't read pointed-to memory!).
if (r.argCheck && !r.argCheck(args)) continue;
return r.action;
}
return defaultAction; // closed by default: deny everything unlisted
};
}
// Layered filters: every filter runs, the STRICTEST action wins.
function evaluateStack(filters, data) {
return filters.reduce((acc, f) => Math.min(acc, f(data)), ALLOW);
}
// Example: a renderer that may read/write/mmap, may exit, and may call
// fcntl ONLY with F_GETFL (cmd === 3). Everything else is killed.
const SYS = { read: 0, write: 1, mmap: 9, fcntl: 72, exit_group: 231, execve: 59 };
const renderer = makeFilter([
{ nr: SYS.read, action: ALLOW },
{ nr: SYS.write, action: ALLOW },
{ nr: SYS.mmap, action: ALLOW },
{ nr: SYS.fcntl, action: ALLOW, argCheck: (a) => a[1] === 3 },
{ nr: SYS.exit_group, action: ALLOW },
]);
console.log(renderer({ nr: SYS.read, arch: "x86_64", args: [3, 0, 4096] })); // 3 ALLOW
console.log(renderer({ nr: SYS.execve, arch: "x86_64", args: [] })); // 0 KILL
console.log(renderer({ nr: SYS.fcntl, arch: "x86_64", args: [3, 4] })); // 0 KILL (cmd 4 = F_SETFL)
console.log(renderer({ nr: SYS.read, arch: "x86", args: [] })); // 0 KILL (wrong ABI!)
Two details mirror the real kernel exactly. First, the arch check is rule zero — drop it and the whole filter is bypassable by switching ABIs. Second, argCheck only ever inspects scalar register values (a[1]), never the bytes a pointer points at — that's the TOCTOU-safety boundary that classic seccomp cannot cross.
A real, working filter in Python
This actually runs on Linux using pyseccomp (the official libseccomp binding; pip install pyseccomp). It builds a closed-by-default allow-list, demonstrates an argument-level rule, then proves the sandbox by attempting a forbidden call:
import os, sys, errno
import pyseccomp as seccomp
def install_sandbox():
# Default action for anything NOT explicitly allowed: kill the process.
f = seccomp.SyscallFilter(defaction=seccomp.KILL_PROCESS)
# The bare minimum to print and exit cleanly.
for name in ("write", "read", "exit", "exit_group",
"rt_sigreturn", "brk", "mmap", "munmap", "fstat"):
f.add_rule(seccomp.ALLOW, name)
# Argument-level rule: allow fcntl ONLY when cmd == F_GETFL (3).
# Note we compare a SCALAR register, never a pointer's contents.
f.add_rule(seccomp.ALLOW, "fcntl",
seccomp.Arg(1, seccomp.EQ, 3))
# Let openat fail softly with EPERM instead of killing — handy when a
# library probes for an optional file and can cope with failure.
f.add_rule(seccomp.ERRNO(errno.EPERM), "openat")
f.load() # one-way: from here on the kernel enforces the filter
if __name__ == "__main__":
# On a kernel without CAP_SYS_ADMIN, libseccomp sets no_new_privs for us.
install_sandbox()
print("inside the sandbox: write() is allowed")
# Soft-denied: returns -1/EPERM rather than crashing.
try:
os.open("/etc/passwd", os.O_RDONLY)
except PermissionError:
print("openat blocked with EPERM, as configured")
# Hard-denied: socket() is not on the allow-list -> SIGSYS, process dies here.
import socket
socket.socket(socket.AF_INET, socket.SOCK_STREAM) # never returns
print("this line is unreachable")
Run it and you'll see the two allowed lines print, then the process is killed by SIGSYS at the socket() call — the final print never executes. libseccomp quietly handles the multi-arch and BPF-codegen pitfalls for you, which is why hand-rolling raw cBPF is discouraged outside of teaching.
Variants and the wider ecosystem
seccomp strict mode. The 2005 original: prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT), four syscalls, no configuration. Still used by some compute-only sandboxes that genuinely need nothing else.
User notification (seccomp_unotify). Since Linux 5.0, SECCOMP_RET_USER_NOTIF parks the target on a forbidden call and hands a supervisor process the chance to inspect, emulate, or fulfill it — even reading the target's memory via /proc/pid/mem to safely check a path. This is how some runtimes emulate syscalls (e.g. brokered file opens) without granting them.
libseccomp. The portable C library (with bindings for Python, Go, Rust) that compiles a high-level allow/deny list into optimized, multi-arch cBPF, binary-searching sorted syscall tables for O(log k) dispatch. Almost everyone uses this rather than raw BPF.
Landlock. A newer (Linux 5.13, 2021) unprivileged LSM that does what seccomp cannot — restrict access by filesystem path — and is meant to sit alongside seccomp, not replace it.
BPF LSM and eBPF. seccomp uses classic BPF; the modern extended BPF powers richer, programmable security hooks via the BPF LSM, but seccomp deliberately sticks with the simpler, easier-to-audit cBPF on the hot syscall path.
Common bugs and edge cases
- Forgetting the architecture check. The number-to-syscall mapping differs between x86-64 and x86-32. A filter that validates only 64-bit numbers is bypassed by issuing the 32-bit ABI variant. Always reject any
archthat isn't your expectedAUDIT_ARCHvalue as rule zero. - Trying to filter on pointer contents. You cannot safely read the path passed to
openin a cBPF filter — between your check and the syscall another thread can overwrite that memory (TOCTOU). Useseccomp_unotifyor Landlock for path policy. - Allowing the wrong "harmless" syscall.
ptracelets a process commandeer a sibling;socketreopens the network;clonewith the right flags creates a privileged context. Audit the allow-list for escalation primitives. - Multiplexed syscalls. On 32-bit x86,
socketcallandipcbundle many operations behind one number, so a coarse rule lets all of them through. Newer kernels expose the individual calls; prefer those. - Installing the filter too late. Filters apply only to syscalls made after
load(). Initialize everything — open log files, dlopen libraries, resolve DNS — first, then drop into the sandbox before touching untrusted input. - Killing on a syscall the C library needs. glibc may call
rt_sigreturn,futex,brk, or newer variants likeclock_gettime64you didn't anticipate; an over-tight KILL filter crashes on innocuous internal calls. Test under the real libc and considerSECCOMP_RET_LOGfirst to discover the true set.
Frequently asked questions
What's the difference between SECCOMP_RET_KILL and SECCOMP_RET_ERRNO?
KILL terminates the process (SIGSYS) the instant it makes a forbidden call, which is the safest default. ERRNO lets the call return as if it failed with a chosen error like EPERM, so the program keeps running — useful when a library does an optional syscall you'd rather see fail gracefully than crash on. There's also TRAP (raise SIGSYS so a handler can fake a result), TRACE (hand off to a ptrace supervisor), LOG (record and allow), and ALLOW.
Can a seccomp filter inspect the contents of a pointer argument, like a file path?
No. A classic seccomp-bpf filter can only read the syscall number and the six register arguments — it cannot dereference a pointer, because the value could be changed by another thread between the check and the call (a TOCTOU race). To filter on a path you need seccomp_unotify (SECCOMP_RET_USER_NOTIF), where a supervisor process reads the target's memory through /proc/pid/mem after pinning it, or a fuller mechanism like Landlock or an LSM.
How much does a seccomp filter slow a program down?
A short filter adds roughly 50–100 nanoseconds per syscall — the cost of running a handful of BPF instructions on entry. The classic filter is a linear list, so cost grows with the number of comparisons; long filters that compare hundreds of syscall numbers can dominate. Since Linux 5.11 a constant-action bitmap fast-path lets the kernel answer filters whose action doesn't depend on the arguments without running BPF at all, making the common case nearly free.
Why does adding a seccomp filter require either root or no_new_privs?
Without it, an unprivileged process could install a filter that returns ERRNO for setuid, then exec a setuid-root binary that silently mis-behaves because it thinks dropping privileges succeeded. Setting PR_SET_NO_NEW_PRIVS first guarantees no exec can ever gain privileges, closing that escalation path, so the kernel then lets any process install a filter.
Is seccomp a complete sandbox on its own?
No — it's one layer. seccomp restricts which syscalls are reachable but says nothing about which files, networks, or users a permitted syscall can touch. Real sandboxes (Docker, gVisor, Chrome, systemd units) stack seccomp with namespaces, cgroups, capabilities, mount restrictions, and an LSM like AppArmor or SELinux. seccomp's job is to shrink the kernel attack surface — the number of syscalls an attacker can use to find a kernel bug.
What's the most common bug when writing a seccomp filter by hand?
Forgetting that the syscall number alone is meaningless without checking the architecture. The same number means different syscalls on x86-64 vs x86-32, so an attacker on a multi-arch kernel can flip to the 32-bit ABI and slip past a filter that only validated 64-bit numbers. Every robust filter starts by checking data.arch and killing anything that isn't the expected AUDIT_ARCH value.