Systems
Context Switch
The work the kernel does when one thread leaves the CPU and another arrives.
A context switch is the operation by which a CPU stops executing one thread, saves its state into memory, loads another thread's saved state, and resumes. It is the mechanism behind multitasking — and one of the most underestimated hidden costs in modern software, ranging from a few hundred nanoseconds for direct register save/restore up to tens of microseconds when caches and TLBs are invalidated along the way.
- Direct cost~1–3 µs
- Realistic cost (cache cold)5–10 µs
- Process vs thread switchProcess is ~2× costlier
- Goroutine / fiber~100 ns
- Default time slice (CFS)~1–10 ms
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
What gets saved, what gets restored
A "context" is everything the CPU needs to resume a thread mid-instruction. On x86-64 Linux, that's:
- The 16 general-purpose registers (RAX–R15) and RIP, RFLAGS.
- The floating-point and SIMD state (~512 bytes for AVX, several KB for AVX-512). Saved lazily via
XSAVE/XRSTORwhen the new thread first uses them. - The kernel stack pointer for that thread.
- For a process switch: CR3 (page-table root), and segment registers FS/GS (which hold thread-local-storage pointers — Linux uses FS for TLS).
The direct save/restore costs maybe 200–500 nanoseconds on modern hardware. What makes context switches actually expensive is the indirect cost: when the next thread starts running, the L1 and L2 caches are full of the old thread's data. Every memory access misses; the CPU stalls; the branch predictor has to relearn. Studies on Intel Skylake have measured the cache-warmup tax at 10–100 µs after a context switch, depending on the working set.
Voluntary, involuntary, and the why-did-this-happen breakdown
| Trigger | Voluntary? | Typical cause | Common cost |
|---|---|---|---|
| Blocking syscall (read, futex, poll) | Voluntary | Thread waits on I/O or a lock | 1–5 µs |
| sched_yield / sleep(0) | Voluntary | Cooperative yield | 1–3 µs |
| Time-slice expired (CFS tick) | Involuntary | Scheduler preempts to share CPU | 2–10 µs (cache cold) |
| Higher-priority thread woken | Involuntary | Real-time wakeup, signal | 2–10 µs |
| Process switch (different mm) | Either | Different process scheduled | 5–20 µs (TLB/PCID effects) |
| Hardware interrupt path | Forced | NIC IRQ, timer, IPI | 0.3–2 µs (kernel-only) |
| User-space (goroutine, fiber) | Voluntary | chan send/recv, await | ~100 ns |
| Hardware-supported (Intel CET / Restartable Sequences) | — | Lock-free fast paths, signal-safe critical sections | 0 ns (avoids the switch) |
/proc/<pid>/status reports the running totals as voluntary_ctxt_switches and nonvoluntary_ctxt_switches. A high voluntary count usually means lock contention or chatty I/O; a high involuntary count means CPU oversubscription.
A minimal benchmark you can run
The simplest measurement is a pipe ping-pong between two processes. Each round trip is 2 context switches plus a tiny syscall round-trip.
// pipe_pong.c — gcc -O2 pipe_pong.c -o pipe_pong
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/resource.h>
#include <time.h>
int main(void) {
int p2c[2], c2p[2];
pipe(p2c); pipe(c2p);
pid_t pid = fork();
char b = 0;
if (pid == 0) {
for (int i = 0; i < 100000; i++) {
read(p2c[0], &b, 1);
write(c2p[1], &b, 1);
}
return 0;
}
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int i = 0; i < 100000; i++) {
write(p2c[1], &b, 1);
read(c2p[0], &b, 1);
}
clock_gettime(CLOCK_MONOTONIC, &t1);
long long ns = (t1.tv_sec - t0.tv_sec) * 1000000000LL + (t1.tv_nsec - t0.tv_nsec);
printf("Round-trip: %.2f µs (~%.2f µs per ctx switch)\n",
ns / 100000.0 / 1000.0, ns / 100000.0 / 2000.0);
struct rusage ru; getrusage(RUSAGE_SELF, &ru);
printf("vol=%ld invol=%ld\n", ru.ru_nvcsw, ru.ru_nivcsw);
wait(NULL);
}
Typical output on a modern Linux box: ~3–6 µs per round trip, which is ~1.5–3 µs per switch. Compare with perf bench sched pipe for a curated equivalent.
The same idea in Python and Node:
import os, time, resource
r1, w1 = os.pipe()
r2, w2 = os.pipe()
if os.fork() == 0:
while True:
os.read(r1, 1); os.write(w2, b'x')
t0 = time.monotonic_ns()
N = 100_000
for _ in range(N):
os.write(w1, b'x'); os.read(r2, 1)
ns = time.monotonic_ns() - t0
print(f"per round-trip: {ns/N/1000:.2f} µs")
ru = resource.getrusage(resource.RUSAGE_SELF)
print(f"voluntary={ru.ru_nvcsw} involuntary={ru.ru_nivcsw}")
// pipe-pong.js — node pipe-pong.js
const { fork } = require('child_process');
if (process.argv[2] === 'child') {
process.on('message', m => process.send(m));
} else {
const c = fork(__filename, ['child']);
let n = 100000, t0 = process.hrtime.bigint();
c.on('message', () => {
if (--n > 0) c.send(0);
else {
const us = Number(process.hrtime.bigint() - t0) / 100000 / 1000;
console.log(`per round-trip: ${us.toFixed(2)} µs`);
c.kill();
}
});
c.send(0);
}
Variants of "switching"
Cooperative vs preemptive
- Cooperative — threads only yield voluntarily (early Mac OS, Windows 3.x, single-threaded Node). Cheap switches, but one bad thread hangs the system.
- Preemptive — kernel timer interrupts force switches at slice boundaries. Required for fairness in any modern OS.
- Hybrid — Go's runtime is cooperative-with-preemption-points; since Go 1.14 it inserts asynchronous preemption via signals so a tight CPU loop can no longer monopolize a P.
Hardware support
- Intel CET (Control-flow Enforcement) — adds a shadow stack to detect ROP attacks; on a context switch, the shadow stack pointer is saved/restored alongside RSP.
- Restartable Sequences (rseq) — Linux feature where the kernel notices if a thread is in a critical section at preemption time and restarts it. Lets userspace implement per-CPU data structures (memory allocators, percpu counters) without locks.
- PCID / ASID — process context identifiers tag TLB entries with a process ID so they survive a CR3 swap. Without PCID, every process switch flushes the entire TLB.
Process vs thread vs userland
- Process switch: change of
mm_struct→ CR3 reload → TLB invalidation (mitigated by PCID). - Thread switch (same process): same
mm_struct, same CR3, TLB intact. Cheaper. - Userland switch (goroutine, fiber, async/await): no kernel involvement; just a few register stores. ~100 ns, no TLB or cache disruption.
Common pitfalls
- TLB shootdowns from unmaps. Calling
munmapormprotectin one thread sends an IPI to every CPU running a thread of the same process. A server doing heavy short-livedmmaps can lose 5–15% of throughput to shootdowns; useMADV_DONTNEEDor pooled allocators. - Thundering herd. N threads waiting on the same event, all woken on every signal, switch in, find no work, switch out. Use
EPOLLEXCLUSIVE/SO_REUSEPORT. - Oversubscription. Running 4× more threads than cores on a CPU-bound workload doubles context-switch counts and halves throughput. Match thread count to core count for compute, or use async I/O for I/O-bound work.
- False idleness. A thread that "sleeps" via
nanosleep(0)still incurs two switches. Tight polling loops should yield rarely or use lock-free queues. - NUMA migration. The scheduler may move a thread between sockets; suddenly its working set is in the wrong node's cache. Pin with
taskset/sched_setaffinityfor hot paths. - Signal-driven switches. Sending
SIGUSR1to wake a thread is one switch in, one out — and it interrupts whatever signal-unsafe code was running. Prefereventfdor condition variables.
Designing for fewer switches
The cheapest switch is the one you don't do. Batch I/O so a single syscall handles many events. Use async runtimes (epoll, io_uring, Tokio, libuv) to keep N connections on M ≪ N kernel threads. Pin worker threads to CPUs with sched_setaffinity. Avoid pipelines that ping a value across threads on every record; lock-free SPSC queues with batching beat hand-offs by 10× because they don't switch at all.
Frequently asked questions
How long does a context switch take?
On modern Linux x86-64, the direct cost of saving and restoring registers is a few hundred nanoseconds, but the realistic cost — including cache and TLB pollution — is 1 to 10 microseconds. Cross-process switches that flush the TLB (or that run on isolated PCID-less hardware) can be substantially more expensive than thread-only switches within one process.
What's the difference between a voluntary and involuntary context switch?
Voluntary means the thread blocked or yielded — it called sleep, read on an empty pipe, or waited on a futex. Involuntary means the scheduler preempted it because its time slice ran out or a higher-priority thread became runnable. /proc/<pid>/status reports both as voluntary_ctxt_switches and nonvoluntary_ctxt_switches.
Why is a process context switch more expensive than a thread switch?
Threads in the same process share an address space, so the page tables stay valid and the TLB does not need flushing. A process switch loads CR3 with a new page-table root, invalidating most TLB entries (PCID tagging mitigates but doesn't eliminate the cost). Cache lines also get invalidated when the new process touches different memory.
What is a TLB shootdown?
When one CPU modifies a page-table entry, every other CPU that may have cached that translation in its TLB must invalidate it. The kernel sends an inter-processor interrupt (IPI) to all relevant CPUs, each preempts whatever it was running, invalidates, and replies. Heavy mmap/munmap traffic in a multithreaded server can produce hundreds of these per second.
How can I measure context-switch cost?
Use perf bench sched pipe — it ping-pongs a single byte through a pipe between two processes and reports the round-trip latency, which is roughly 2 context switches plus the kernel work. perf stat -e context-switches and getrusage(RUSAGE_SELF) report counts that you can correlate with wall-clock time.
Do user-space threads (goroutines, fibers) avoid context-switch cost?
Mostly. Switching between goroutines, Erlang processes, or coroutines costs ~100 nanoseconds because no kernel transition or TLB invalidation occurs — only register-window swaps in userspace. The catch is that a blocking syscall by any user-space thread blocks its underlying kernel thread, which is why Go and Erlang both schedule extra OS threads to absorb that.