Systems
Process Scheduling
Hundreds of threads, eight cores, microsecond decisions
Process scheduling decides which runnable thread gets the CPU next. Modern kernels juggle hundreds of threads on a handful of cores using fairness algorithms like CFS, priority queues for real-time tasks, and load-balancing across NUMA nodes — all in microseconds per decision.
- Direct context switch~1–3 µs
- Indirect (cache refill)5–50 µs
- CFS pick (red-black tree)O(log n)
- Default timer tick (HZ=1000)1 ms
- SCHED_FIFO preemptionImmediate, by priority
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a scheduler picks the next thread
A modern OS has way more runnable work than CPUs. The scheduler's job is to decide, every few milliseconds, who gets to keep running, who gets preempted, and which idle thread gets woken next. It runs in the timer interrupt, on every system call return, and on every event that changes a task's state — wake, sleep, exit.
The Linux Completely Fair Scheduler (CFS) is the production case study. Every SCHED_OTHER task carries a vruntime — virtual runtime — that increases as it consumes CPU. CFS picks the task with the smallest vruntime. Highly weighted (low-nice) tasks accumulate vruntime slower, so they run more often. The ready queue is a red-black tree keyed on vruntime; pick is O(log n) and update is O(log n).
The mental model: every runnable task is a runner on a treadmill. The treadmill speeds vary by priority. CFS hands the CPU to whoever's behind, briefly, then puts them back on. There are no ad-hoc heuristics for "interactive" vs "batch" — the math just works out: a task that spends most of its time blocked has a low vruntime and snaps to the front whenever it wakes, which is exactly what you want for keystrokes and audio frames.
Real-time classes (SCHED_FIFO, SCHED_RR, SCHED_DEADLINE) sit above CFS. Any runnable real-time task preempts every fair task, period. Within a real-time class, priorities are strict: higher priority always wins. SCHED_FIFO never time-slices among equal priorities; SCHED_RR does.
When you actually touch the scheduler
- Latency-sensitive servers: pin worker threads to cores, raise their priority, or use
SCHED_FIFOfor critical paths. - Real-time control loops (robotics, audio, motion control):
SCHED_FIFOorSCHED_DEADLINEwith a watchdog. - Background work: lower nice (
nice +19,SCHED_IDLE,ionice) so it yields under load. - Containers: cgroup CPU shares and quotas to bound noisy neighbors.
- Custom workloads on Linux 6.12+: a BPF
sched_extpolicy tuned to your access pattern.
Most application code never touches the scheduler directly — and that's correct. Misusing real-time classes locks up systems; SCHED_FIFO with no upper bound on CPU is how you discover priority inversion in production at 3 AM.
Scheduling algorithms compared
| Pick complexity | Fairness | Preemption | Strengths | Weaknesses | |
|---|---|---|---|---|---|
| FCFS / FIFO | O(1) | None | None (cooperative) | Trivially simple | Convoy effect — one long task blocks everything |
| Round-Robin | O(1) | Equal | Time slice (e.g. 10 ms) | Bounded wait, easy to implement | No priority; quantum tuning is fragile |
| O(N) Linux (pre-2.6) | O(N) per pick | Priority + heuristics | Yes | Simple goodness function | Scaled poorly past ~100 tasks |
| O(1) Linux (2.6.0–2.6.22) | O(1) via per-priority arrays | Heuristic interactivity bonus | Yes | Constant-time pick | Heuristics misclassified workloads |
| CFS (Linux 2.6.23+) | O(log n) red-black tree | Weighted virtual runtime | Yes (vruntime gap) | No heuristics; provably fair | Tree bookkeeping; tail latency on huge runqueues |
| Lottery scheduling | O(N) draw, O(log N) tree | Probabilistic by ticket count | Yes | Easy proportional share, easy to compose | Variance — short-term fairness is statistical |
| SCHED_FIFO (real-time) | O(1) per priority | None — priority strict | Higher priority preempts | Bounded latency for top-priority task | Starves lower priorities; no time slicing |
| SCHED_DEADLINE (EDF) | O(log n) | By deadline urgency | Earliest-deadline-first | Hard real-time guarantees with admission control | Requires period/deadline parameters |
| EAS (big.LITTLE energy-aware) | O(cores) | Energy-aware placement | Yes | Saves battery on heterogeneous CPUs | Complex; mobile-tuned heuristics |
CFS dominates Linux because its invariant — equal share of CPU per weight — needs no per-workload tuning. Real-time classes layer on top for tasks that need timing guarantees CFS can't promise. sched_ext is the pressure valve: ship a custom BPF policy for the rare workload where the defaults fall down.
C: change scheduling class with sched_setscheduler
#include <sched.h>
#include <sys/mman.h>
#include <errno.h>
#include <stdio.h>
int make_realtime(int prio) {
// Lock all current and future memory to avoid page faults during RT work.
if (mlockall(MCL_CURRENT | MCL_FUTURE) < 0) return -1;
struct sched_param p = { .sched_priority = prio }; // 1..99 for FIFO
if (sched_setscheduler(0, SCHED_FIFO, &p) < 0) {
if (errno == EPERM) fprintf(stderr, "Need CAP_SYS_NICE\n");
return -1;
}
return 0;
}
SCHED_FIFO with priority 50 runs ahead of every CFS task on the system. The mlockall call is mandatory in practice — a major page fault during a real-time path can blow your latency budget by 10× or more. Use chrt -f 50 ./your-program from the shell for the same effect without code changes.
Python: nice and OS-level priority
import os, psutil
def be_polite():
"""Lower priority so we yield to interactive work."""
os.nice(10) # niceness 0..19, higher = lower priority
def be_aggressive():
"""Raise priority within unprivileged limits (RLIMIT_NICE)."""
p = psutil.Process()
p.nice(-5) # negative niceness, requires CAP_SYS_NICE on Linux
p.cpu_affinity([0, 1]) # pin to cores 0 and 1
p.ionice(psutil.IOPRIO_CLASS_BE, value=0) # raise I/O priority too
Niceness 0 is the default. +10 halves your effective CPU share; +19 means you only run when nothing else wants the CPU. Negative nice raises share but requires capabilities — and rarely improves things on a busy system because CFS is already fair. For batch workloads, SCHED_BATCH (or Python's os.SCHED_BATCH) tells CFS the task is not latency-sensitive, so it gets longer slices and fewer preemptions.
Node.js: worker threads and OS priority
import { Worker } from 'node:worker_threads';
import { exec } from 'node:child_process';
const w = new Worker('./crunch.js');
// Lower the worker's priority on Linux. Node has no portable nice() API yet.
exec(`renice -n 10 -p ${w.threadId}`);
// Pin a hot path to a single core to reduce migration jitter.
exec(`taskset -p -c 3 ${process.pid}`);
Node has no first-class scheduler API — most adjustments come from the OS side: nice, chrt, taskset, cgroup limits. The right pattern for CPU-heavy work is to spawn a worker thread, let CFS spread it, and reach for affinity only when you measure migration jitter hurting tail latency.
Scheduler families and their levers
- CFS heuristics worth knowing:
sched_min_granularity_ns(smallest slice ~750 µs),sched_latency_ns(target period ~6 ms),sched_wakeup_granularity_ns(preemption hysteresis), all tunable in/sys/kernel/debug/sched/. - EAS for big.LITTLE. ARM phones run an Energy-Aware Scheduler that places tasks on small/big cores by predicted energy cost. It's why your phone stays cool while a single foreground app runs at 2.8 GHz.
- SCHED_DEADLINE. Set
(runtime, deadline, period)and the kernel admits or rejects you up front. Used by automotive control systems and high-end audio servers. - BPF sched_ext (Linux 6.12+). Load a BPF program that implements
pick_next_taskandbalance. Production users include scx_rusty (Rust-implemented topology-aware) and scx_lavd (latency-aware). - Lottery and stride scheduling. Academic ideas now showing up as quota mechanisms in cgroup v2 (
cpu.max) — bandwidth, not lottery, but the proportional-share spirit is the same. - Gang scheduling. HPC and VM systems schedule whole groups of related threads simultaneously to avoid synchronization stalls.
Costed claims
- Context switch: 1–3 µs direct cost on modern x86 (register save/restore + address-space switch). 5–50 µs of indirect cost from L1/L2/TLB misses afterward, dominant on memory-heavy workloads.
- Wake-up latency: CFS wakes a thread within ~10 µs from a cooperating signal. SCHED_FIFO sub-microsecond once the IPI fires.
- Pick cost: CFS pick = single rb-tree leftmost lookup, ~50 ns. Real-time pick = O(1) array index, sub-10 ns.
- Quantum: default 100 ms historically, now 6 ms target latency in CFS divided by runnable count, floored at 0.75 ms. Round-Robin
SCHED_RRdefaults to 100 ms. - Per-CPU runqueue: Linux maintains one rq per CPU; load balancing across rqs runs on a slower cadence (~ms) to avoid cross-CPU cache thrash.
Common bugs and edge cases
- Priority inversion in real-time. A high-priority FIFO task waits on a mutex held by a lower-priority task, which never gets to run. Fix: priority inheritance mutexes (
PTHREAD_PRIO_INHERIT) orSCHED_DEADLINEbandwidth reservations. - Runaway SCHED_FIFO. A SCHED_FIFO task in an infinite loop never yields. Linux's RT throttling caps real-time CPU at 95% by default (
/proc/sys/kernel/sched_rt_runtime_us) — disabling it for "performance" is how you brick a server. - Page faults in RT paths. A single major fault busts a microsecond budget. Always
mlockallbefore going real-time. - cgroup v2
cpu.maxthrottling. A bursty container hits its quota and gets throttled mid-request, looking like a stall. Watchnr_throttledandthrottled_usecincpu.stat. - Affinity vs migration. Pinning to a single core kills load balancing. The right scope is usually a NUMA node, not one core.
- Hyperthread contention. Two threads on sibling SMT lanes share L1; benchmarks regress when CFS happens to colocate them. Tools like
perf c2cdiagnose it; isolation domains (isolcpus,cpuset) fix it.
Frequently asked questions
What does the scheduler actually do on each tick?
It decides whether the currently running thread should keep the CPU or be preempted, picks the next thread to run from the ready queue if so, and optionally rebalances tasks across cores. On Linux this happens roughly every 1–10 ms via the timer tick plus on every wake-up.
How expensive is a context switch?
A direct context switch costs about 1–3 µs on modern x86 — register save/restore, address space change, kernel bookkeeping. Indirect cost from cache and TLB misses afterward can add 5–50 µs depending on working set, often the dominant factor.
Why was the O(1) scheduler replaced with CFS?
The O(1) scheduler used heuristics to detect interactive vs batch tasks and could starve workloads that didn't fit either pattern. CFS dropped the heuristics in favor of a simple invariant — every runnable task gets its fair share of CPU time, tracked as virtual runtime — which behaves predictably on every workload.
What's the difference between SCHED_FIFO and SCHED_RR?
SCHED_FIFO runs the highest-priority task to completion or until it blocks — no time slicing among equal-priority tasks, so a runaway thread starves everything below it. SCHED_RR adds round-robin slicing among equal-priority threads. Both preempt all SCHED_OTHER (CFS) tasks.
Can I plug in a custom scheduler?
On Linux 6.12+ yes, via sched_ext — BPF-based scheduler programs that the kernel loads at runtime. Several production schedulers ship out of tree (scx_rusty, scx_lavd) and Meta runs them on game servers. Before sched_ext you had to patch the kernel.