Systems

Context Switch

The work the kernel does when one thread leaves the CPU and another arrives.

A context switch is the operation by which a CPU stops executing one thread, saves its state into memory, loads another thread's saved state, and resumes. It is the mechanism behind multitasking — and one of the most underestimated hidden costs in modern software, ranging from a few hundred nanoseconds for direct register save/restore up to tens of microseconds when caches and TLBs are invalidated along the way.

  • Direct cost~1–3 µs
  • Realistic cost (cache cold)5–10 µs
  • Process vs thread switchProcess is ~2× costlier
  • Goroutine / fiber~100 ns
  • Default time slice (CFS)~1–10 ms

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

What gets saved, what gets restored

A "context" is everything the CPU needs to resume a thread mid-instruction. On x86-64 Linux, that's:

  • The 16 general-purpose registers (RAX–R15) and RIP, RFLAGS.
  • The floating-point and SIMD state (~512 bytes for AVX, several KB for AVX-512). Saved lazily via XSAVE/XRSTOR when the new thread first uses them.
  • The kernel stack pointer for that thread.
  • For a process switch: CR3 (page-table root), and segment registers FS/GS (which hold thread-local-storage pointers — Linux uses FS for TLS).

The direct save/restore costs maybe 200–500 nanoseconds on modern hardware. What makes context switches actually expensive is the indirect cost: when the next thread starts running, the L1 and L2 caches are full of the old thread's data. Every memory access misses; the CPU stalls; the branch predictor has to relearn. Studies on Intel Skylake have measured the cache-warmup tax at 10–100 µs after a context switch, depending on the working set.

Voluntary, involuntary, and the why-did-this-happen breakdown

TriggerVoluntary?Typical causeCommon cost
Blocking syscall (read, futex, poll)VoluntaryThread waits on I/O or a lock1–5 µs
sched_yield / sleep(0)VoluntaryCooperative yield1–3 µs
Time-slice expired (CFS tick)InvoluntaryScheduler preempts to share CPU2–10 µs (cache cold)
Higher-priority thread wokenInvoluntaryReal-time wakeup, signal2–10 µs
Process switch (different mm)EitherDifferent process scheduled5–20 µs (TLB/PCID effects)
Hardware interrupt pathForcedNIC IRQ, timer, IPI0.3–2 µs (kernel-only)
User-space (goroutine, fiber)Voluntarychan send/recv, await~100 ns
Hardware-supported (Intel CET / Restartable Sequences)Lock-free fast paths, signal-safe critical sections0 ns (avoids the switch)

/proc/<pid>/status reports the running totals as voluntary_ctxt_switches and nonvoluntary_ctxt_switches. A high voluntary count usually means lock contention or chatty I/O; a high involuntary count means CPU oversubscription.

A minimal benchmark you can run

The simplest measurement is a pipe ping-pong between two processes. Each round trip is 2 context switches plus a tiny syscall round-trip.

// pipe_pong.c — gcc -O2 pipe_pong.c -o pipe_pong
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/resource.h>
#include <time.h>

int main(void) {
  int p2c[2], c2p[2];
  pipe(p2c); pipe(c2p);
  pid_t pid = fork();
  char b = 0;
  if (pid == 0) {
    for (int i = 0; i < 100000; i++) {
      read(p2c[0], &b, 1);
      write(c2p[1], &b, 1);
    }
    return 0;
  }
  struct timespec t0, t1;
  clock_gettime(CLOCK_MONOTONIC, &t0);
  for (int i = 0; i < 100000; i++) {
    write(p2c[1], &b, 1);
    read(c2p[0], &b, 1);
  }
  clock_gettime(CLOCK_MONOTONIC, &t1);
  long long ns = (t1.tv_sec - t0.tv_sec) * 1000000000LL + (t1.tv_nsec - t0.tv_nsec);
  printf("Round-trip: %.2f µs (~%.2f µs per ctx switch)\n",
         ns / 100000.0 / 1000.0, ns / 100000.0 / 2000.0);
  struct rusage ru; getrusage(RUSAGE_SELF, &ru);
  printf("vol=%ld invol=%ld\n", ru.ru_nvcsw, ru.ru_nivcsw);
  wait(NULL);
}

Typical output on a modern Linux box: ~3–6 µs per round trip, which is ~1.5–3 µs per switch. Compare with perf bench sched pipe for a curated equivalent.

The same idea in Python and Node:

import os, time, resource

r1, w1 = os.pipe()
r2, w2 = os.pipe()
if os.fork() == 0:
    while True:
        os.read(r1, 1); os.write(w2, b'x')
t0 = time.monotonic_ns()
N = 100_000
for _ in range(N):
    os.write(w1, b'x'); os.read(r2, 1)
ns = time.monotonic_ns() - t0
print(f"per round-trip: {ns/N/1000:.2f} µs")
ru = resource.getrusage(resource.RUSAGE_SELF)
print(f"voluntary={ru.ru_nvcsw} involuntary={ru.ru_nivcsw}")
// pipe-pong.js — node pipe-pong.js
const { fork } = require('child_process');
if (process.argv[2] === 'child') {
  process.on('message', m => process.send(m));
} else {
  const c = fork(__filename, ['child']);
  let n = 100000, t0 = process.hrtime.bigint();
  c.on('message', () => {
    if (--n > 0) c.send(0);
    else {
      const us = Number(process.hrtime.bigint() - t0) / 100000 / 1000;
      console.log(`per round-trip: ${us.toFixed(2)} µs`);
      c.kill();
    }
  });
  c.send(0);
}

Variants of "switching"

Cooperative vs preemptive

  • Cooperative — threads only yield voluntarily (early Mac OS, Windows 3.x, single-threaded Node). Cheap switches, but one bad thread hangs the system.
  • Preemptive — kernel timer interrupts force switches at slice boundaries. Required for fairness in any modern OS.
  • Hybrid — Go's runtime is cooperative-with-preemption-points; since Go 1.14 it inserts asynchronous preemption via signals so a tight CPU loop can no longer monopolize a P.

Hardware support

  • Intel CET (Control-flow Enforcement) — adds a shadow stack to detect ROP attacks; on a context switch, the shadow stack pointer is saved/restored alongside RSP.
  • Restartable Sequences (rseq) — Linux feature where the kernel notices if a thread is in a critical section at preemption time and restarts it. Lets userspace implement per-CPU data structures (memory allocators, percpu counters) without locks.
  • PCID / ASID — process context identifiers tag TLB entries with a process ID so they survive a CR3 swap. Without PCID, every process switch flushes the entire TLB.

Process vs thread vs userland

  • Process switch: change of mm_struct → CR3 reload → TLB invalidation (mitigated by PCID).
  • Thread switch (same process): same mm_struct, same CR3, TLB intact. Cheaper.
  • Userland switch (goroutine, fiber, async/await): no kernel involvement; just a few register stores. ~100 ns, no TLB or cache disruption.

Common pitfalls

  • TLB shootdowns from unmaps. Calling munmap or mprotect in one thread sends an IPI to every CPU running a thread of the same process. A server doing heavy short-lived mmaps can lose 5–15% of throughput to shootdowns; use MADV_DONTNEED or pooled allocators.
  • Thundering herd. N threads waiting on the same event, all woken on every signal, switch in, find no work, switch out. Use EPOLLEXCLUSIVE / SO_REUSEPORT.
  • Oversubscription. Running 4× more threads than cores on a CPU-bound workload doubles context-switch counts and halves throughput. Match thread count to core count for compute, or use async I/O for I/O-bound work.
  • False idleness. A thread that "sleeps" via nanosleep(0) still incurs two switches. Tight polling loops should yield rarely or use lock-free queues.
  • NUMA migration. The scheduler may move a thread between sockets; suddenly its working set is in the wrong node's cache. Pin with taskset / sched_setaffinity for hot paths.
  • Signal-driven switches. Sending SIGUSR1 to wake a thread is one switch in, one out — and it interrupts whatever signal-unsafe code was running. Prefer eventfd or condition variables.

Designing for fewer switches

The cheapest switch is the one you don't do. Batch I/O so a single syscall handles many events. Use async runtimes (epoll, io_uring, Tokio, libuv) to keep N connections on M ≪ N kernel threads. Pin worker threads to CPUs with sched_setaffinity. Avoid pipelines that ping a value across threads on every record; lock-free SPSC queues with batching beat hand-offs by 10× because they don't switch at all.

Frequently asked questions

How long does a context switch take?

On modern Linux x86-64, the direct cost of saving and restoring registers is a few hundred nanoseconds, but the realistic cost — including cache and TLB pollution — is 1 to 10 microseconds. Cross-process switches that flush the TLB (or that run on isolated PCID-less hardware) can be substantially more expensive than thread-only switches within one process.

What's the difference between a voluntary and involuntary context switch?

Voluntary means the thread blocked or yielded — it called sleep, read on an empty pipe, or waited on a futex. Involuntary means the scheduler preempted it because its time slice ran out or a higher-priority thread became runnable. /proc/<pid>/status reports both as voluntary_ctxt_switches and nonvoluntary_ctxt_switches.

Why is a process context switch more expensive than a thread switch?

Threads in the same process share an address space, so the page tables stay valid and the TLB does not need flushing. A process switch loads CR3 with a new page-table root, invalidating most TLB entries (PCID tagging mitigates but doesn't eliminate the cost). Cache lines also get invalidated when the new process touches different memory.

What is a TLB shootdown?

When one CPU modifies a page-table entry, every other CPU that may have cached that translation in its TLB must invalidate it. The kernel sends an inter-processor interrupt (IPI) to all relevant CPUs, each preempts whatever it was running, invalidates, and replies. Heavy mmap/munmap traffic in a multithreaded server can produce hundreds of these per second.

How can I measure context-switch cost?

Use perf bench sched pipe — it ping-pongs a single byte through a pipe between two processes and reports the round-trip latency, which is roughly 2 context switches plus the kernel work. perf stat -e context-switches and getrusage(RUSAGE_SELF) report counts that you can correlate with wall-clock time.

Do user-space threads (goroutines, fibers) avoid context-switch cost?

Mostly. Switching between goroutines, Erlang processes, or coroutines costs ~100 nanoseconds because no kernel transition or TLB invalidation occurs — only register-window swaps in userspace. The catch is that a blocking syscall by any user-space thread blocks its underlying kernel thread, which is why Go and Erlang both schedule extra OS threads to absorb that.