Systems
System Call
The 50-nanosecond gate between your code and the kernel
A system call is a controlled jump from user-space code into the kernel — the only way an unprivileged process can do I/O, allocate memory, fork, or talk to the network. The boundary itself is fast (~50 ns), but anything you do across it (context switches, copies, schedule decisions) compounds quickly at high call rates.
- Syscall boundary~50 ns
- Function call~1 ns
- vDSO clock_gettime~10 ns
- Linux x86_64 syscalls~440
- errno scopePer-thread
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a syscall happens
User-space code runs with the CPU at ring 3 — the lowest privilege level on x86. It can't access I/O ports, can't change page tables, can't speak directly to the disk controller. To do anything that matters, it asks the kernel via a syscall, which is a tightly controlled doorway into ring 0.
The flow on Linux x86_64:
- Userland loads the syscall number into
raxand arguments intordi, rsi, rdx, r10, r8, r9. - It executes the
syscallinstruction. The CPU savesrip/rflags, swaps to the kernel stack fromMSR_KERNEL_GS_BASE, raises CPL to 0, and jumps to the address inMSR_LSTAR— the kernel's syscall entry point. - The kernel entry stub validates user pointers, dispatches via the syscall table, runs the handler.
- On return,
sysretqrestores ring 3 and resumes user code with the result inrax(negative = error code, with libc translating to -1 +errno).
Three things make this safe. First, only the entry stub is reachable from user code — you can't jmp into the middle of a syscall handler. Second, the kernel never trusts user-space pointers; copy_from_user uses page-table walks and exception fixups so a bad pointer gives EFAULT instead of crashing the kernel. Third, KPTI (Kernel Page Table Isolation) keeps most of the kernel unmapped from user-space page tables to mitigate Meltdown — which is why pre-2018 syscalls were noticeably faster.
When syscall cost actually matters
- Million-IOPS storage: NVMe latency is ~10 µs; a 1 µs syscall is 10% overhead per op.
- 10/40/100 GbE networking: per-packet syscalls saturate a CPU long before the link.
- High-frequency timestamps: a clock_gettime in a tight loop is ~10× slower as a syscall than via vDSO.
- Profiling:
stracecan slow a process by 10–100×. - Containers and sandboxing: seccomp filters add a small per-call cost (~30 ns) — usually fine, but it's measurable.
For most application code, syscall cost is invisible. It becomes visible the moment you build a server doing more than a few hundred thousand operations per second per core.
Syscall mechanisms compared
| Mechanism | Cost | Used by | Notes |
|---|---|---|---|
| int 0x80 | ~250 ns | Linux pre-2.6 i386 | Software interrupt via IDT; slow, serializes pipeline |
| sysenter / sysexit | ~100 ns | Linux i386 ≥ 2.6, Windows x86 | Intel fast-call; bypasses IDT but doesn't preserve all flags |
| syscall / sysret | ~50 ns | Linux x86_64, FreeBSD, macOS | MSR-driven branch; KPTI roughly doubles cost on older CPUs |
| svc (Arm), ecall (RISC-V) | ~50–100 ns | Linux Arm64, RISC-V | Architecturally cleaner; same overhead profile |
| vDSO call | ~10 ns | clock_gettime, gettimeofday, getcpu, time | User-space stub on shared kernel page; no mode switch |
| io_uring submit | ~0 amortized | Linux 5.1+ async I/O | Lock-free SQE/CQE rings; SQPOLL = zero syscalls |
| DPDK / SPDK / RDMA | 0 (no kernel) | HFT, HPC, NVMe-oF | User-space device drivers; burns a polling CPU |
The progression cuts overhead per cross. int 0x80 went through full interrupt machinery; syscall is an MSR-driven branch; vDSO removes the cross entirely for read-only operations; io_uring batches many ops per cross via shared rings; kernel-bypass frameworks remove the kernel from the path completely.
C / x86_64: a syscall by hand
// write(1, "hi\n", 3) without libc.
#include <sys/syscall.h>
int main(void) {
long ret;
__asm__ volatile (
"syscall"
: "=a"(ret) // rax ← return
: "0"(SYS_write), "D"(1), "S"("hi\n"), "d"(3) // rax, rdi, rsi, rdx
: "rcx", "r11", "memory" // syscall clobbers rcx, r11
);
return ret == 3 ? 0 : 1;
}
The Linux x86_64 ABI passes the syscall number in rax and the first six args in rdi, rsi, rdx, r10, r8, r9. syscall clobbers rcx (it stashes the return address there) and r11 (it saves rflags there). On return, rax is the result; values in [-4095, -1] are errors that libc converts to -1 with errno = -rax. Direct syscalls bypass libc entirely — useful for static binaries, Rust's std::process bootstrap, and seccomp-confined code.
strace output, decoded
openat(AT_FDCWD, "/etc/hostname", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=12, ...}) = 0
read(3, "darkstar\n\0\0\0", 4096) = 12
close(3) = 0
write(1, "darkstar\n", 9) = 9
exit_group(0) = ?
Each line is one syscall, return value on the right. strace -c produces a count and time profile — the fastest way to see where a sluggish process is spending its kernel time. Beware: strace uses ptrace and is itself extremely expensive (often 10–100× slowdown). Use perf trace for low-overhead profiling.
Python: syscall via ctypes and the cost of vDSO
import ctypes, ctypes.util, time
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
SYS_getpid = 39 # x86_64
# 1) Direct syscall via libc syscall(2).
syscall = libc.syscall
syscall.restype = ctypes.c_long
syscall.argtypes = (ctypes.c_long,)
t0 = time.perf_counter_ns()
for _ in range(1_000_000):
syscall(SYS_getpid) # cached in task_struct, but still crosses
print("syscall: ", (time.perf_counter_ns() - t0) / 1e6, "ms")
# 2) Same call routed through libc cache (no syscall after fork).
t0 = time.perf_counter_ns()
for _ in range(1_000_000):
libc.getpid()
print("cached: ", (time.perf_counter_ns() - t0) / 1e6, "ms")
# 3) clock_gettime: vDSO-backed, no kernel cross at all.
t0 = time.perf_counter_ns()
for _ in range(1_000_000):
time.monotonic_ns()
print("vDSO: ", (time.perf_counter_ns() - t0) / 1e6, "ms")
On a modern Skylake, the three loops typically print around 60 ms, 5 ms, and 25 ms — vDSO is faster than the cached getpid only because the loop overhead dominates the latter. The point is that the syscall version is consistently 10× more expensive than the same operation routed through user-space-only paths.
Node.js: syscall-heavy patterns and how to fix them
// BAD: one syscall per chunk on a hot loop.
import { readSync, openSync } from 'node:fs';
const fd = openSync('/var/log/big.log', 'r');
const buf = Buffer.alloc(64); // tiny buffer = one syscall per 64 bytes
let n; while ((n = readSync(fd, buf, 0, 64)) > 0) { /* process */ }
// GOOD: large buffer + sendfile via stream pipeline (under the hood).
import { createReadStream } from 'node:fs';
import { createServer } from 'node:http';
import { pipeline } from 'node:stream/promises';
createServer(async (_, res) => {
res.writeHead(200);
await pipeline(createReadStream('/var/log/big.log', { highWaterMark: 64 * 1024 }), res);
}).listen(8080);
Node's syscall surface is largely hidden behind libuv. The fastest rule: increase highWaterMark on streams when you're moving bulk data, and prefer pipeline() over manual read/write loops so libuv can use sendfile/io_uring under the hood. Profile with strace -c -p <pid> and look for million-call counts on read, write, or recvfrom.
Variants and modern bypasses
- vDSO functions.
clock_gettime,gettimeofday,getcpu,time— user-space stubs on shared kernel data, no mode switch. - io_uring. Lock-free queues mapped into user space;
IORING_SETUP_SQPOLLlets a kernel thread poll submissions, so the user process makes zero syscalls. - seccomp / BPF filters. Per-process syscall allowlists. Docker default profile blocks ~50 risky calls at ~30 ns per filter.
- syscall user dispatch. Linux 5.11+ in-process syscall interception — used by Wine and CRIU.
- vsyscall. Pre-vDSO fixed kernel page; deprecated due to ASLR concerns.
- Kernel bypass. DPDK/SPDK/RDMA/XDP move I/O into user space or into JITed BPF, avoiding per-packet syscalls entirely.
Costed claims
- syscall instruction: ~50 ns on uncontested hardware (Skylake). KPTI adds 30–100 ns on older CPUs, less on Ice Lake+.
- Function call: ~1 ns when not branch-mispredicted; effectively 0 with inlining.
- vDSO clock_gettime: ~10 ns — basically a TSC read plus shared-page math.
- strace overhead: ptrace doubles each syscall by stopping the tracee and waking the tracer. Easily 10–100× slowdown on syscall-heavy code.
- io_uring batching: 1 syscall to submit 64 ops = ~1 ns per op overhead, or 0 with SQPOLL.
- Linux x86_64 syscall table: ~440 numbered entries as of 6.x; new ones added via
NR_*append. - seccomp BPF filter cost: ~30 ns per call for typical Docker default profiles.
Common bugs and edge cases
- errno is per-thread. Reading
errnoafter a non-syscall function call (evenprintf) is meaningless — anything in between can clobber it. Capture immediately after the failing call. - EINTR retry loops. Many syscalls return
-1/EINTRwhen interrupted by a signal. Naive code that doesn't retry corrupts state. UseSA_RESTARTor wrap in a helper. - Unchecked short reads/writes.
read()/write()can return less than requested without error. Loops that assume the full size silently corrupt data on slow pipes or sockets. - strace-induced heisenbug. A reproducible failure disappears under strace because the slowdown changes timing windows. Use perf trace or ftrace.
- TOCTOU.
access()thenopen()is a race the attacker can exploit between the two syscalls. UseO_NOFOLLOW+fstaton the open fd. - seccomp-blocked calls become EPERM, not crash. Hard to debug if you don't know seccomp is in play; unexpected
EPERMin strace is the tell. - rseq state leaks. glibc per-CPU caches use restartable sequences; direct syscall users sometimes corrupt that state and break malloc.
Frequently asked questions
Why does crossing into the kernel cost more than a function call?
A function call is one branch — a few cycles. A syscall switches CPU privilege level (ring 3 → ring 0), swaps the stack, may swap part of the page table for KPTI, validates arguments, and runs an entirely separate code path. Even with the dedicated syscall instruction it's ~50 ns vs ~1 ns for a function call. KPTI mitigations roughly doubled this cost on older CPUs.
What's the vDSO?
The virtual Dynamic Shared Object is a tiny kernel-supplied library mapped into every process. It implements a few high-frequency calls (gettimeofday, clock_gettime, getcpu) entirely in user space using shared kernel data — saving the syscall round trip. A clock_gettime via vDSO is ~10 ns; via syscall it's ~100 ns.
Is errno actually a single global?
No — that would be a disaster in multithreaded code. On modern libc it's defined as a macro that expands to a thread-local variable: __errno_location() on glibc, errno() on musl. Each thread has its own errno, so a syscall on one thread doesn't clobber another's error code.
Why does my server bottleneck on syscalls before disk?
On NVMe SSDs a 4 KB read takes ~10 µs of disk; the syscall plus context switch overhead can be 1–2 µs. At a million IOPS, syscall overhead alone burns whole CPU cores. That's why io_uring exists — submit and complete batched I/O without crossing the boundary on every operation.
Can a process bypass the kernel for I/O?
Yes, with hardware help. DPDK (network) and SPDK (storage) map device queues into user space and poll them directly. RDMA exposes a user-space verbs interface backed by the NIC. These bypass the kernel entirely once set up — common in HFT, HPC, and high-end databases.