Why does crossing into the kernel cost more than a function call?

A function call is one branch — a few cycles. A syscall switches CPU privilege level (ring 3 → ring 0), swaps the stack, may swap part of the page table for KPTI, validates arguments, and runs an entirely separate code path. Even with the dedicated syscall instruction it's ~50 ns vs ~1 ns for a function call. KPTI mitigations roughly doubled this cost on older CPUs.

The virtual Dynamic Shared Object is a tiny kernel-supplied library mapped into every process. It implements a few high-frequency calls (gettimeofday, clock_gettime, getcpu) entirely in user space using shared kernel data — saving the syscall round trip. A clock_gettime via vDSO is ~10 ns; via syscall it's ~100 ns.

Is errno actually a single global?

No — that would be a disaster in multithreaded code. On modern libc it's defined as a macro that expands to a thread-local variable: __errno_location() on glibc, errno() on musl. Each thread has its own errno, so a syscall on one thread doesn't clobber another's error code.

Why does my server bottleneck on syscalls before disk?

On NVMe SSDs a 4 KB read takes ~10 µs of disk; the syscall plus context switch overhead can be 1–2 µs. At a million IOPS, syscall overhead alone burns whole CPU cores. That's why io_uring exists — submit and complete batched I/O without crossing the boundary on every operation.

Can a process bypass the kernel for I/O?

Yes, with hardware help. DPDK (network) and SPDK (storage) map device queues into user space and poll them directly. RDMA exposes a user-space verbs interface backed by the NIC. These bypass the kernel entirely once set up — common in HFT, HPC, and high-end databases.

System Call — How User Code Crosses Into the Kernel

How a syscall happens

User-space code runs with the CPU at ring 3 — the lowest privilege level on x86. It can't access I/O ports, can't change page tables, can't speak directly to the disk controller. To do anything that matters, it asks the kernel via a syscall, which is a tightly controlled doorway into ring 0.

The flow on Linux x86_64:

Userland loads the syscall number into rax and arguments into rdi, rsi, rdx, r10, r8, r9.
It executes the syscall instruction. The CPU saves rip/rflags, swaps to the kernel stack from MSR_KERNEL_GS_BASE, raises CPL to 0, and jumps to the address in MSR_LSTAR — the kernel's syscall entry point.
The kernel entry stub validates user pointers, dispatches via the syscall table, runs the handler.
On return, sysretq restores ring 3 and resumes user code with the result in rax (negative = error code, with libc translating to -1 + errno).

Three things make this safe. First, only the entry stub is reachable from user code — you can't jmp into the middle of a syscall handler. Second, the kernel never trusts user-space pointers; copy_from_user uses page-table walks and exception fixups so a bad pointer gives EFAULT instead of crashing the kernel. Third, KPTI (Kernel Page Table Isolation) keeps most of the kernel unmapped from user-space page tables to mitigate Meltdown — which is why pre-2018 syscalls were noticeably faster.

When syscall cost actually matters

Million-IOPS storage: NVMe latency is ~10 µs; a 1 µs syscall is 10% overhead per op.
10/40/100 GbE networking: per-packet syscalls saturate a CPU long before the link.
High-frequency timestamps: a clock_gettime in a tight loop is ~10× slower as a syscall than via vDSO.
Profiling: strace can slow a process by 10–100×.
Containers and sandboxing: seccomp filters add a small per-call cost (~30 ns) — usually fine, but it's measurable.

For most application code, syscall cost is invisible. It becomes visible the moment you build a server doing more than a few hundred thousand operations per second per core.

Syscall mechanisms compared

Mechanism	Cost	Used by	Notes
int 0x80	~250 ns	Linux pre-2.6 i386	Software interrupt via IDT; slow, serializes pipeline
sysenter / sysexit	~100 ns	Linux i386 ≥ 2.6, Windows x86	Intel fast-call; bypasses IDT but doesn't preserve all flags
syscall / sysret	~50 ns	Linux x86_64, FreeBSD, macOS	MSR-driven branch; KPTI roughly doubles cost on older CPUs
svc (Arm), ecall (RISC-V)	~50–100 ns	Linux Arm64, RISC-V	Architecturally cleaner; same overhead profile
vDSO call	~10 ns	clock_gettime, gettimeofday, getcpu, time	User-space stub on shared kernel page; no mode switch
io_uring submit	~0 amortized	Linux 5.1+ async I/O	Lock-free SQE/CQE rings; SQPOLL = zero syscalls
DPDK / SPDK / RDMA	0 (no kernel)	HFT, HPC, NVMe-oF	User-space device drivers; burns a polling CPU

The progression cuts overhead per cross. int 0x80 went through full interrupt machinery; syscall is an MSR-driven branch; vDSO removes the cross entirely for read-only operations; io_uring batches many ops per cross via shared rings; kernel-bypass frameworks remove the kernel from the path completely.

C / x86_64: a syscall by hand

// write(1, "hi\n", 3) without libc.
#include <sys/syscall.h>
int main(void) {
    long ret;
    __asm__ volatile (
        "syscall"
        : "=a"(ret)                                    // rax  ← return
        : "0"(SYS_write), "D"(1), "S"("hi\n"), "d"(3)  // rax, rdi, rsi, rdx
        : "rcx", "r11", "memory"                       // syscall clobbers rcx, r11
    );
    return ret == 3 ? 0 : 1;
}

The Linux x86_64 ABI passes the syscall number in rax and the first six args in rdi, rsi, rdx, r10, r8, r9. syscall clobbers rcx (it stashes the return address there) and r11 (it saves rflags there). On return, rax is the result; values in [-4095, -1] are errors that libc converts to -1 with errno = -rax. Direct syscalls bypass libc entirely — useful for static binaries, Rust's std::process bootstrap, and seccomp-confined code.

strace output, decoded

openat(AT_FDCWD, "/etc/hostname", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=12, ...}) = 0
read(3, "darkstar\n\0\0\0", 4096)        = 12
close(3)                                 = 0
write(1, "darkstar\n", 9)                = 9
exit_group(0)                            = ?

Each line is one syscall, return value on the right. strace -c produces a count and time profile — the fastest way to see where a sluggish process is spending its kernel time. Beware: strace uses ptrace and is itself extremely expensive (often 10–100× slowdown). Use perf trace for low-overhead profiling.

Python: syscall via ctypes and the cost of vDSO

import ctypes, ctypes.util, time

libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
SYS_getpid = 39   # x86_64

# 1) Direct syscall via libc syscall(2).
syscall = libc.syscall
syscall.restype = ctypes.c_long
syscall.argtypes = (ctypes.c_long,)

t0 = time.perf_counter_ns()
for _ in range(1_000_000):
    syscall(SYS_getpid)            # cached in task_struct, but still crosses
print("syscall: ", (time.perf_counter_ns() - t0) / 1e6, "ms")

# 2) Same call routed through libc cache (no syscall after fork).
t0 = time.perf_counter_ns()
for _ in range(1_000_000):
    libc.getpid()
print("cached:  ", (time.perf_counter_ns() - t0) / 1e6, "ms")

# 3) clock_gettime: vDSO-backed, no kernel cross at all.
t0 = time.perf_counter_ns()
for _ in range(1_000_000):
    time.monotonic_ns()
print("vDSO:    ", (time.perf_counter_ns() - t0) / 1e6, "ms")

On a modern Skylake, the three loops typically print around 60 ms, 5 ms, and 25 ms — vDSO is faster than the cached getpid only because the loop overhead dominates the latter. The point is that the syscall version is consistently 10× more expensive than the same operation routed through user-space-only paths.

Node.js: syscall-heavy patterns and how to fix them

// BAD: one syscall per chunk on a hot loop.
import { readSync, openSync } from 'node:fs';
const fd = openSync('/var/log/big.log', 'r');
const buf = Buffer.alloc(64); // tiny buffer = one syscall per 64 bytes
let n; while ((n = readSync(fd, buf, 0, 64)) > 0) { /* process */ }

// GOOD: large buffer + sendfile via stream pipeline (under the hood).
import { createReadStream } from 'node:fs';
import { createServer } from 'node:http';
import { pipeline } from 'node:stream/promises';
createServer(async (_, res) => {
  res.writeHead(200);
  await pipeline(createReadStream('/var/log/big.log', { highWaterMark: 64 * 1024 }), res);
}).listen(8080);

Node's syscall surface is largely hidden behind libuv. The fastest rule: increase highWaterMark on streams when you're moving bulk data, and prefer pipeline() over manual read/write loops so libuv can use sendfile/io_uring under the hood. Profile with strace -c -p <pid> and look for million-call counts on read, write, or recvfrom.

Variants and modern bypasses

vDSO functions. clock_gettime, gettimeofday, getcpu, time — user-space stubs on shared kernel data, no mode switch.
io_uring. Lock-free queues mapped into user space; IORING_SETUP_SQPOLL lets a kernel thread poll submissions, so the user process makes zero syscalls.
seccomp / BPF filters. Per-process syscall allowlists. Docker default profile blocks ~50 risky calls at ~30 ns per filter.
syscall user dispatch. Linux 5.11+ in-process syscall interception — used by Wine and CRIU.
vsyscall. Pre-vDSO fixed kernel page; deprecated due to ASLR concerns.
Kernel bypass. DPDK/SPDK/RDMA/XDP move I/O into user space or into JITed BPF, avoiding per-packet syscalls entirely.

Costed claims

syscall instruction: ~50 ns on uncontested hardware (Skylake). KPTI adds 30–100 ns on older CPUs, less on Ice Lake+.
Function call: ~1 ns when not branch-mispredicted; effectively 0 with inlining.
vDSO clock_gettime: ~10 ns — basically a TSC read plus shared-page math.
strace overhead: ptrace doubles each syscall by stopping the tracee and waking the tracer. Easily 10–100× slowdown on syscall-heavy code.
io_uring batching: 1 syscall to submit 64 ops = ~1 ns per op overhead, or 0 with SQPOLL.
Linux x86_64 syscall table: ~440 numbered entries as of 6.x; new ones added via NR_* append.
seccomp BPF filter cost: ~30 ns per call for typical Docker default profiles.

Common bugs and edge cases

errno is per-thread. Reading errno after a non-syscall function call (even printf) is meaningless — anything in between can clobber it. Capture immediately after the failing call.
EINTR retry loops. Many syscalls return -1/EINTR when interrupted by a signal. Naive code that doesn't retry corrupts state. Use SA_RESTART or wrap in a helper.
Unchecked short reads/writes. read()/write() can return less than requested without error. Loops that assume the full size silently corrupt data on slow pipes or sockets.
strace-induced heisenbug. A reproducible failure disappears under strace because the slowdown changes timing windows. Use perf trace or ftrace.
TOCTOU. access() then open() is a race the attacker can exploit between the two syscalls. Use O_NOFOLLOW + fstat on the open fd.
seccomp-blocked calls become EPERM, not crash. Hard to debug if you don't know seccomp is in play; unexpected EPERM in strace is the tell.
rseq state leaks. glibc per-CPU caches use restartable sequences; direct syscall users sometimes corrupt that state and break malloc.

System Call

Interactive visualization

Watch the 60-second explainer

How a syscall happens

When syscall cost actually matters

Syscall mechanisms compared

C / x86_64: a syscall by hand

strace output, decoded

Python: syscall via ctypes and the cost of vDSO

Node.js: syscall-heavy patterns and how to fix them

Variants and modern bypasses

Costed claims

Common bugs and edge cases

Frequently asked questions

Interactive visualization

Watch the 60-second explainer

How a syscall happens

When syscall cost actually matters

Syscall mechanisms compared

C / x86_64: a syscall by hand

strace output, decoded

Python: syscall via ctypes and the cost of vDSO

Node.js: syscall-heavy patterns and how to fix them

Variants and modern bypasses

Costed claims

Common bugs and edge cases

Frequently asked questions

Related concepts