Systems

Process Scheduling

Hundreds of threads, eight cores, microsecond decisions

Process scheduling decides which runnable thread gets the CPU next. Modern kernels juggle hundreds of threads on a handful of cores using fairness algorithms like CFS, priority queues for real-time tasks, and load-balancing across NUMA nodes — all in microseconds per decision.

  • Direct context switch~1–3 µs
  • Indirect (cache refill)5–50 µs
  • CFS pick (red-black tree)O(log n)
  • Default timer tick (HZ=1000)1 ms
  • SCHED_FIFO preemptionImmediate, by priority

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a scheduler picks the next thread

A modern OS has way more runnable work than CPUs. The scheduler's job is to decide, every few milliseconds, who gets to keep running, who gets preempted, and which idle thread gets woken next. It runs in the timer interrupt, on every system call return, and on every event that changes a task's state — wake, sleep, exit.

The Linux Completely Fair Scheduler (CFS) is the production case study. Every SCHED_OTHER task carries a vruntime — virtual runtime — that increases as it consumes CPU. CFS picks the task with the smallest vruntime. Highly weighted (low-nice) tasks accumulate vruntime slower, so they run more often. The ready queue is a red-black tree keyed on vruntime; pick is O(log n) and update is O(log n).

The mental model: every runnable task is a runner on a treadmill. The treadmill speeds vary by priority. CFS hands the CPU to whoever's behind, briefly, then puts them back on. There are no ad-hoc heuristics for "interactive" vs "batch" — the math just works out: a task that spends most of its time blocked has a low vruntime and snaps to the front whenever it wakes, which is exactly what you want for keystrokes and audio frames.

Real-time classes (SCHED_FIFO, SCHED_RR, SCHED_DEADLINE) sit above CFS. Any runnable real-time task preempts every fair task, period. Within a real-time class, priorities are strict: higher priority always wins. SCHED_FIFO never time-slices among equal priorities; SCHED_RR does.

When you actually touch the scheduler

  • Latency-sensitive servers: pin worker threads to cores, raise their priority, or use SCHED_FIFO for critical paths.
  • Real-time control loops (robotics, audio, motion control): SCHED_FIFO or SCHED_DEADLINE with a watchdog.
  • Background work: lower nice (nice +19, SCHED_IDLE, ionice) so it yields under load.
  • Containers: cgroup CPU shares and quotas to bound noisy neighbors.
  • Custom workloads on Linux 6.12+: a BPF sched_ext policy tuned to your access pattern.

Most application code never touches the scheduler directly — and that's correct. Misusing real-time classes locks up systems; SCHED_FIFO with no upper bound on CPU is how you discover priority inversion in production at 3 AM.

Scheduling algorithms compared

Pick complexityFairnessPreemptionStrengthsWeaknesses
FCFS / FIFOO(1)NoneNone (cooperative)Trivially simpleConvoy effect — one long task blocks everything
Round-RobinO(1)EqualTime slice (e.g. 10 ms)Bounded wait, easy to implementNo priority; quantum tuning is fragile
O(N) Linux (pre-2.6)O(N) per pickPriority + heuristicsYesSimple goodness functionScaled poorly past ~100 tasks
O(1) Linux (2.6.0–2.6.22)O(1) via per-priority arraysHeuristic interactivity bonusYesConstant-time pickHeuristics misclassified workloads
CFS (Linux 2.6.23+)O(log n) red-black treeWeighted virtual runtimeYes (vruntime gap)No heuristics; provably fairTree bookkeeping; tail latency on huge runqueues
Lottery schedulingO(N) draw, O(log N) treeProbabilistic by ticket countYesEasy proportional share, easy to composeVariance — short-term fairness is statistical
SCHED_FIFO (real-time)O(1) per priorityNone — priority strictHigher priority preemptsBounded latency for top-priority taskStarves lower priorities; no time slicing
SCHED_DEADLINE (EDF)O(log n)By deadline urgencyEarliest-deadline-firstHard real-time guarantees with admission controlRequires period/deadline parameters
EAS (big.LITTLE energy-aware)O(cores)Energy-aware placementYesSaves battery on heterogeneous CPUsComplex; mobile-tuned heuristics

CFS dominates Linux because its invariant — equal share of CPU per weight — needs no per-workload tuning. Real-time classes layer on top for tasks that need timing guarantees CFS can't promise. sched_ext is the pressure valve: ship a custom BPF policy for the rare workload where the defaults fall down.

C: change scheduling class with sched_setscheduler

#include <sched.h>
#include <sys/mman.h>
#include <errno.h>
#include <stdio.h>

int make_realtime(int prio) {
    // Lock all current and future memory to avoid page faults during RT work.
    if (mlockall(MCL_CURRENT | MCL_FUTURE) < 0) return -1;

    struct sched_param p = { .sched_priority = prio };  // 1..99 for FIFO
    if (sched_setscheduler(0, SCHED_FIFO, &p) < 0) {
        if (errno == EPERM) fprintf(stderr, "Need CAP_SYS_NICE\n");
        return -1;
    }
    return 0;
}

SCHED_FIFO with priority 50 runs ahead of every CFS task on the system. The mlockall call is mandatory in practice — a major page fault during a real-time path can blow your latency budget by 10× or more. Use chrt -f 50 ./your-program from the shell for the same effect without code changes.

Python: nice and OS-level priority

import os, psutil

def be_polite():
    """Lower priority so we yield to interactive work."""
    os.nice(10)                         # niceness 0..19, higher = lower priority

def be_aggressive():
    """Raise priority within unprivileged limits (RLIMIT_NICE)."""
    p = psutil.Process()
    p.nice(-5)                          # negative niceness, requires CAP_SYS_NICE on Linux
    p.cpu_affinity([0, 1])              # pin to cores 0 and 1
    p.ionice(psutil.IOPRIO_CLASS_BE, value=0)  # raise I/O priority too

Niceness 0 is the default. +10 halves your effective CPU share; +19 means you only run when nothing else wants the CPU. Negative nice raises share but requires capabilities — and rarely improves things on a busy system because CFS is already fair. For batch workloads, SCHED_BATCH (or Python's os.SCHED_BATCH) tells CFS the task is not latency-sensitive, so it gets longer slices and fewer preemptions.

Node.js: worker threads and OS priority

import { Worker } from 'node:worker_threads';
import { exec } from 'node:child_process';

const w = new Worker('./crunch.js');
// Lower the worker's priority on Linux. Node has no portable nice() API yet.
exec(`renice -n 10 -p ${w.threadId}`);

// Pin a hot path to a single core to reduce migration jitter.
exec(`taskset -p -c 3 ${process.pid}`);

Node has no first-class scheduler API — most adjustments come from the OS side: nice, chrt, taskset, cgroup limits. The right pattern for CPU-heavy work is to spawn a worker thread, let CFS spread it, and reach for affinity only when you measure migration jitter hurting tail latency.

Scheduler families and their levers

  • CFS heuristics worth knowing: sched_min_granularity_ns (smallest slice ~750 µs), sched_latency_ns (target period ~6 ms), sched_wakeup_granularity_ns (preemption hysteresis), all tunable in /sys/kernel/debug/sched/.
  • EAS for big.LITTLE. ARM phones run an Energy-Aware Scheduler that places tasks on small/big cores by predicted energy cost. It's why your phone stays cool while a single foreground app runs at 2.8 GHz.
  • SCHED_DEADLINE. Set (runtime, deadline, period) and the kernel admits or rejects you up front. Used by automotive control systems and high-end audio servers.
  • BPF sched_ext (Linux 6.12+). Load a BPF program that implements pick_next_task and balance. Production users include scx_rusty (Rust-implemented topology-aware) and scx_lavd (latency-aware).
  • Lottery and stride scheduling. Academic ideas now showing up as quota mechanisms in cgroup v2 (cpu.max) — bandwidth, not lottery, but the proportional-share spirit is the same.
  • Gang scheduling. HPC and VM systems schedule whole groups of related threads simultaneously to avoid synchronization stalls.

Costed claims

  • Context switch: 1–3 µs direct cost on modern x86 (register save/restore + address-space switch). 5–50 µs of indirect cost from L1/L2/TLB misses afterward, dominant on memory-heavy workloads.
  • Wake-up latency: CFS wakes a thread within ~10 µs from a cooperating signal. SCHED_FIFO sub-microsecond once the IPI fires.
  • Pick cost: CFS pick = single rb-tree leftmost lookup, ~50 ns. Real-time pick = O(1) array index, sub-10 ns.
  • Quantum: default 100 ms historically, now 6 ms target latency in CFS divided by runnable count, floored at 0.75 ms. Round-Robin SCHED_RR defaults to 100 ms.
  • Per-CPU runqueue: Linux maintains one rq per CPU; load balancing across rqs runs on a slower cadence (~ms) to avoid cross-CPU cache thrash.

Common bugs and edge cases

  • Priority inversion in real-time. A high-priority FIFO task waits on a mutex held by a lower-priority task, which never gets to run. Fix: priority inheritance mutexes (PTHREAD_PRIO_INHERIT) or SCHED_DEADLINE bandwidth reservations.
  • Runaway SCHED_FIFO. A SCHED_FIFO task in an infinite loop never yields. Linux's RT throttling caps real-time CPU at 95% by default (/proc/sys/kernel/sched_rt_runtime_us) — disabling it for "performance" is how you brick a server.
  • Page faults in RT paths. A single major fault busts a microsecond budget. Always mlockall before going real-time.
  • cgroup v2 cpu.max throttling. A bursty container hits its quota and gets throttled mid-request, looking like a stall. Watch nr_throttled and throttled_usec in cpu.stat.
  • Affinity vs migration. Pinning to a single core kills load balancing. The right scope is usually a NUMA node, not one core.
  • Hyperthread contention. Two threads on sibling SMT lanes share L1; benchmarks regress when CFS happens to colocate them. Tools like perf c2c diagnose it; isolation domains (isolcpus, cpuset) fix it.

Frequently asked questions

What does the scheduler actually do on each tick?

It decides whether the currently running thread should keep the CPU or be preempted, picks the next thread to run from the ready queue if so, and optionally rebalances tasks across cores. On Linux this happens roughly every 1–10 ms via the timer tick plus on every wake-up.

How expensive is a context switch?

A direct context switch costs about 1–3 µs on modern x86 — register save/restore, address space change, kernel bookkeeping. Indirect cost from cache and TLB misses afterward can add 5–50 µs depending on working set, often the dominant factor.

Why was the O(1) scheduler replaced with CFS?

The O(1) scheduler used heuristics to detect interactive vs batch tasks and could starve workloads that didn't fit either pattern. CFS dropped the heuristics in favor of a simple invariant — every runnable task gets its fair share of CPU time, tracked as virtual runtime — which behaves predictably on every workload.

What's the difference between SCHED_FIFO and SCHED_RR?

SCHED_FIFO runs the highest-priority task to completion or until it blocks — no time slicing among equal-priority tasks, so a runaway thread starves everything below it. SCHED_RR adds round-robin slicing among equal-priority threads. Both preempt all SCHED_OTHER (CFS) tasks.

Can I plug in a custom scheduler?

On Linux 6.12+ yes, via sched_ext — BPF-based scheduler programs that the kernel loads at runtime. Several production schedulers ship out of tree (scx_rusty, scx_lavd) and Meta runs them on game servers. Before sched_ext you had to patch the kernel.