Kernel
eBPF
A safe little VM living inside Linux
eBPF is a sandboxed bytecode VM in the Linux kernel. Programs are statically verified, JIT-compiled to native code, and attached to syscall, packet, or scheduler hooks.
- OriginBPF 1992 · eBPF 2014
- Verifier instruction limit1,000,000 (since 5.2)
- Registers11 × 64-bit
- Typical event overhead~100 ns
- XDP throughput10+ Gbps per core
- Loaded viabpf() syscall
Interactive visualization
Watch a program leave userspace, pass the verifier, JIT-compile, and attach to a kernel hook.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How eBPF works
For thirty years, the way you extended Linux at the kernel level was a loadable kernel module — a chunk of C code, compiled against the running kernel's headers, dynamically inserted into ring 0 with full pointer freedom. One null dereference and the whole machine panicked. eBPF replaces that loaded gun with a tightly governed bytecode VM that lives inside the kernel and runs programs written by ordinary userspace processes.
The lifecycle of an eBPF program has four stages:
- Compile. You write restricted C — no unbounded loops, no global variables in the classic sense, no library calls. Clang's
-target bpfemits eBPF bytecode: a 64-bit RISC-like instruction set with eleven registers, a 512-byte stack, and explicit map-access helper functions. - Load and verify. Userspace calls the
bpf()syscall with the bytecode. The in-kernel verifier walks every possible execution path symbolically. It tracks the type and range of every register at every instruction. If the program can read uninitialized memory, dereference a pointer the verifier hasn't bounded, exceed the 1-million-instruction ceiling, or fail to terminate, the load is rejected. Userspace gets a detailed error log. - JIT-compile. Once verified, the bytecode is compiled to native machine code by the kernel's JIT. On x86_64 this happens through a fast template-based translator. The program now runs at native speed — no interpreter overhead.
- Attach to a hook. The program is registered with a kernel attach point: a kprobe (any kernel function), uprobe (userspace function), tracepoint, XDP (network driver receive), tc (traffic control), socket filter, cgroup hook, scheduler hook, or LSM hook. Every time that point fires, the program runs.
The whole pipeline takes milliseconds. The program runs at kernel-native speed but cannot crash the kernel.
The verifier is the magic
Everything that makes eBPF useful — letting unprivileged tools peer into kernel state, run on production boxes, ship into containers — rests on one component: the verifier. It's the reason you can run a packet-filtering program on a server you don't own.
The verifier does abstract interpretation. For every register at every instruction, it tracks a type (number, pointer to map value, pointer to packet, pointer to stack) and a range (this u32 is between 0 and 1023). When you dereference a pointer, the verifier checks that the pointer's base type is one it can prove won't crash, and that the offset stays inside that base's known bounds.
That's how you can write memcpy(dst, packet_data, len) and have it work safely: before the copy, your code must contain an explicit comparison of len against the packet length, and the verifier observes that comparison and narrows the type of len so the subsequent read can be proven in-bounds.
If you've ever fought the verifier — and everyone who writes eBPF has — you were fighting this abstract-interpretation engine's ability to prove things about your branches. Loops are the hardest case. The kernel added bounded loops (#pragma unroll or a compile-time bound checked by the verifier) in 5.3, which made writing nontrivial programs vastly easier.
Hook points
An eBPF program is useless unless it's attached to a hook — a place in the kernel where it gets invoked. The hooks fall into a few families:
- Tracing. kprobes (any kernel function entry/exit), uprobes (any userspace function), tracepoints (stable kernel probe points). bpftrace and bcc tools are 95% kprobes and tracepoints under the hood.
- Networking — fast path. XDP runs in the NIC driver before any packet allocation. tc filters run on egress and ingress in the standard stack. Both can drop, redirect, or modify packets.
- Networking — sockets. Socket filters (the classic BPF use case), sk_msg programs that intercept socket I/O, sockmap for L7 load balancing.
- Control groups. cgroup_skb, cgroup_sock for per-container network policy and observability.
- Security. LSM (Linux Security Module) eBPF hooks let you enforce policy on syscalls, file opens, capabilities — the foundation Falco and KubeArmor build on.
- Scheduler. sched_ext (since 6.12) lets you write entire scheduling policies as eBPF programs. Game-changer for specialized workloads.
eBPF vs kernel modules vs userspace tracing
| eBPF | Kernel module | Userspace ptrace/perf | |
|---|---|---|---|
| Safety | Verified, can't crash kernel | Full ring-0 access | Safe but limited |
| Per-event overhead | ~100 ns | Native call cost | 50–100 µs |
| Reloadable without reboot | Yes | insmod/rmmod, risky | Always |
| Portable across kernels | CO-RE (BTF) since 5.4 | No, must rebuild | Yes |
| Privilege required | CAP_BPF | root | ptrace cap or root |
| Production-safe at high event rate | Yes | If correct | No, drops events |
| Can modify packets in driver | Yes (XDP) | Yes | No |
The middle column is what eBPF replaced. The right column is what eBPF beats on volume. Together that's most of the production observability and networking stack at hyperscalers.
When to reach for eBPF
- Observability without overhead. Trace every syscall, every page fault, every TCP retransmit on a production server, without a noticeable performance hit. bpftrace one-liners replace what used to be invasive instrumentation.
- High-performance networking. XDP for load balancing, DDoS mitigation, or L4/L7 routing at 10+ Gbps per core. Cilium has built a whole CNI plugin around this — pod networking with no iptables.
- Runtime security. Falco watches syscall patterns to flag exploits in real time. Tracee instruments suspicious behavior. Both rely on eBPF hooks for the data feed.
- Application-specific tracing. uprobes let you observe a function in your own binary — including Go and Rust — without recompiling or attaching a debugger. Pixie does this for service-mesh introspection.
- Custom kernel behavior. sched_ext lets you ship scheduling policies tailored to your workload. Storage, networking, and memory subsystems are gradually opening up to programmable extension.
Pseudo-code lifecycle
// User writes restricted C, then:
//
// 1. clang -target bpf -O2 prog.c -o prog.o
// 2. ./loader prog.o
//
// Where the loader does roughly:
load_program(obj):
bytecode = parse_elf(obj)
fd = syscall(BPF_PROG_LOAD, bytecode, license="GPL")
if fd < 0:
print(kernel_verifier_log) // explains exactly which insn failed
abort()
return fd
attach_to_hook(fd, hook):
if hook == "xdp": syscall(BPF_LINK_CREATE, fd, ifindex)
if hook == "kprobe": syscall(BPF_LINK_CREATE, fd, "tcp_sendmsg")
if hook == "tracepoint": syscall(BPF_LINK_CREATE, fd, "syscalls/sys_enter_openat")
// Inside the kernel, on every hit:
kernel_event_fires():
if jit_compiled:
run_native_machine_code(program)
else:
interpret_bytecode(program)
// program may write to maps, send perf events, or modify the packet
A real eBPF program
A tiny bpftrace one-liner that counts every openat() syscall by process:
$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }'
^C
@[bash]: 17
@[node]: 423
@[python3]: 1208
@[systemd]: 6
Behind that one-liner: bpftrace generated eBPF bytecode that increments a hash-map entry keyed by the calling process's name, attached it to the openat tracepoint, and on Ctrl-C dumped the map. Total verifier roundtrip: a few milliseconds. Per-event cost in the kernel: ~150 ns. You could leave it running on a busy server for hours.
For something more involved, a restricted-C XDP program that drops UDP packets to port 53 (DNS firewall):
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/udp.h>
#include <bpf/bpf_helpers.h>
SEC("xdp")
int dns_drop(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end) return XDP_PASS;
if (eth->h_proto != __constant_htons(ETH_P_IP)) return XDP_PASS;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end) return XDP_PASS;
if (ip->protocol != IPPROTO_UDP) return XDP_PASS;
struct udphdr *udp = (void *)ip + ip->ihl * 4;
if ((void *)(udp + 1) > data_end) return XDP_PASS;
if (udp->dest == __constant_htons(53)) return XDP_DROP;
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
Notice every memory dereference is preceded by a bounds check against data_end. Skip one and the verifier rejects the program. Once accepted, this code runs at line rate — tens of millions of packets per second on a single NIC queue.
Performance and cost
Concrete numbers from production deployments:
- kprobe entry/exit overhead: ~100 ns per event after JIT. A naive program with no map writes is essentially free.
- Map lookup (BPF_MAP_TYPE_HASH): ~50 ns for a hit. Array maps are ~10 ns.
- XDP per-packet processing budget: at 14.88 Mpps (line rate for 64-byte packets on 10 GbE), you have ~67 ns per packet. Cloudflare reports their DDoS XDP fits comfortably.
- Verifier complexity ceiling: 1 million instructions explored, up from 4096 in pre-5.2 kernels. Real-world programs with state, helpers, and unrolled loops run 10k–200k explored insns.
- Program size in memory: dozens of kilobytes typically; large ones (Cilium's datapath) reach hundreds of KB.
The killer feature is that you pay these costs only on events you've subscribed to. You don't pay anything for events you don't hook — unlike systrap-based tracing, which can multiply syscall cost 100× even for paths you don't care about.
Common pitfalls
- Fighting the verifier. A confusing error like "math between pkt_ptr and register with unbounded min value" usually means you read a value from the packet but didn't compare it against a constant before using it as an offset. The fix is almost always a more explicit bounds check.
- Unbounded loops. Pre-5.3 kernels reject any loop the verifier can't fully unroll. Use
#pragma unrollwith a known bound, or upgrade your kernel and use bounded loops with__attribute__((noinline))on inner helpers. - Stack overflow. The eBPF stack is exactly 512 bytes. Large structs need to go in a percpu map, not on the stack.
- Tracepoint vs kprobe drift. kprobes attach to function names, which change between kernel versions. Tracepoints are part of the stable ABI. Prefer tracepoints when one exists. CO-RE (Compile Once Run Everywhere) using BTF data, since 5.4, reduces this pain for kprobe-style programs too.
- Forgetting CAP_BPF on newer kernels. Pre-5.8 you needed CAP_SYS_ADMIN to load eBPF. Since 5.8 there's a dedicated CAP_BPF capability — but most container runtimes don't grant it by default. Container observability tools must explicitly add it to the security context.
Frequently asked questions
What does eBPF actually stand for?
Extended Berkeley Packet Filter. The original BPF, from 1992, was a tiny in-kernel VM that ran tcpdump filters. In 2014, Alexei Starovoitov rewrote it with a richer 64-bit instruction set, eleven registers, maps, and hooks that reach far beyond packet filtering. The name stuck even though the use cases didn't.
Why is eBPF safe to run in the kernel?
The verifier. Before a program is loaded, the kernel statically proves it terminates, never reads uninitialized memory, never dereferences invalid pointers, and stays within a bounded instruction count (1 million since 5.2). Loops must be bounded. If the verifier can't prove safety, the load fails — the program never runs.
How fast is eBPF compared to userspace tracing?
Several orders of magnitude. A kprobe handler in eBPF costs roughly 100 nanoseconds per event; a comparable strace-style ptrace round-trip costs 50-100 microseconds because every event context-switches to userspace and back. For high-volume tracing — every syscall on a busy box — only eBPF is viable.
What is XDP and how does it use eBPF?
eXpress Data Path. XDP runs an eBPF program on the network driver's receive path, before the packet enters the kernel's TCP/IP stack. You can drop, redirect, or modify packets in 50-100 ns per packet — fast enough to handle 10 Gbps DDoS scrubbing on a single core. Cloudflare and Facebook use it for DDoS mitigation; Cilium uses it for service-mesh load balancing.
Can eBPF programs share state with userspace?
Yes, through maps. A map is a kernel-resident key/value store (hash, array, LRU, ring buffer, and many more) that both the eBPF program and userspace tools can read and write via syscalls. Maps are how bpftrace builds histograms, how Cilium stores connection-tracking entries, and how Falco exports security events to userspace.
What kernel hook points can I attach eBPF to?
Many. kprobes hook arbitrary kernel functions. uprobes hook userspace functions. tracepoints attach to static probe points. Networking hooks include XDP, tc (traffic control), socket filters, and cgroup egress. Security hooks include LSM (Linux Security Module) attach points. Scheduling hooks expose sched_switch, sched_wakeup, and friends. The list keeps growing with each kernel release.
Do I have to write eBPF in assembly?
No — almost no one does. The typical workflow is to write a restricted-C program, compile it to eBPF bytecode with clang -target bpf, and load it via libbpf or BCC. Higher-level frontends exist too: bpftrace gives you an awk-like DSL, and Cilium ships pre-built programs you configure declaratively.