Scheduling
Completely Fair Scheduler
A red-black tree, one virtual clock per task, and a promise
CFS is Linux's default CPU scheduler since 2007. It tracks vruntime per task and runs whichever has the smallest, indexed by a red-black tree.
- IntroducedKernel 2.6.23 (Oct 2007)
- AuthorIngo Molnar
- Data structureRed-black tree on vruntime
- sched_latency target6 ms (for ≤8 tasks)
- sched_min_granularity0.75 ms
- Pick next taskO(1) cached leftmost
Interactive visualization
Watch tasks rotate through the red-black tree, vruntime ticking up while one task runs and re-enters.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How CFS works
Imagine you're running a daycare and twenty kids want the swing. The unfair scheduler gives the loudest kid all the turns. The fair scheduler keeps a stopwatch per child, and whenever the swing comes free, the kid who's used it the least gets next. That's it. That's CFS.
Each runnable task in Linux carries a virtual runtime, or vruntime — a u64 counter that advances while the task is on the CPU. The scheduler keeps every runnable task in a red-black tree keyed by vruntime. The leftmost node — the smallest key — is always next to run. When that task's slice expires, the kernel adds the elapsed time (weighted) to its vruntime and re-inserts it into the tree. Whichever task is leftmost now becomes the next one to run.
That single rule — always run whoever has the lowest virtual clock — produces proportional fairness without any explicit time-slice bookkeeping. There's no "this task has 50 ms left," no aging hacks, no run-queue rotation. Just a sorted set keyed by virtual time.
Vruntime and weights
Vruntime advances faster for low-priority tasks and slower for high-priority ones, so giving the CPU to whoever has the smallest vruntime is proportional fairness. The math:
delta_vruntime = delta_exec × (NICE_0_LOAD / weight_of_task)
NICE_0_LOAD is 1024 (the weight of a nice-0 task). A nice-(+5) task has weight ~335, so its vruntime advances ~3× faster than real time — meaning it claims the leftmost spot less often. A nice-(-5) task has weight ~3121, vruntime advances at ~1/3 real time, and it shows up at leftmost three times as often. Each step of niceness changes the weight by a factor of ~1.25.
This single multiplier produces the entire priority system. There are no priority classes inside CFS — just weights mapped from nice values.
The red-black tree
Why a red-black tree and not a heap? Both can give O(log n) insert and O(1) min. The reason is removal: when a task blocks on I/O, it leaves the run queue. A heap can find the min, but it can't efficiently delete an arbitrary element. A red-black tree can — given a pointer to the node, removal is O(log n). And CFS deletes a lot: every time a task blocks on a syscall, sleeps, or otherwise becomes non-runnable, it's removed from the tree. Every time it wakes, it's re-inserted.
The leftmost node — the smallest vruntime — is cached separately, so the "pick next task" call is O(1). Insertions, deletions, and re-insertions are O(log n), where n is the number of runnable tasks on this CPU (not the entire system — CFS uses per-CPU run queues with load balancing between them).
Time slices
CFS doesn't actually use fixed time slices. Instead, the running task is preempted when its vruntime catches up to the next leftmost task — that is, when continuing to run it would push it past someone with less virtual time. In practice this works out to a target latency, controlled by tunables:
- sched_latency_ns (default 6,000,000 ns = 6 ms): the period within which every runnable task should get to run, when there are few tasks.
- sched_min_granularity_ns (default 750,000 ns = 0.75 ms): the floor on a task's slice, so a thrashing system with 100 tasks doesn't reduce each slice to microseconds.
- sched_nr_latency (default 8): the breakpoint. With ≤ 8 tasks, total period = sched_latency. With > 8, period grows linearly: period = nr × sched_min_granularity.
So on a CPU with 4 runnable tasks: each gets 6 ms / 4 = 1.5 ms. With 12 runnable tasks: each gets 0.75 ms (the floor) and the full cycle takes 9 ms.
CFS vs the O(1) scheduler it replaced
| O(1) scheduler (2.6.0 – 2.6.22) | CFS (2.6.23+) | |
|---|---|---|
| Data structure | 140 priority queues (bitmap + arrays) | Red-black tree on vruntime |
| Pick next task | O(1) (find_first_bit) | O(1) (cached leftmost) |
| Insertion | O(1) (append to list) | O(log n) |
| Fairness model | Heuristic: interactivity bonus | Provable proportional fairness |
| Interactive task heuristics | ~700 LoC of heuristics | None — falls out of vruntime |
| Pathological latency | Yes — high-prio task could starve mid-prio | No — every runnable task makes progress |
| Lines of core code | ~9000 | ~3000 at introduction |
The big shift was philosophical. The O(1) scheduler tried to detect interactive tasks (short bursts of CPU between sleeps) and reward them with a priority boost. The detection misfired in subtle, hard-to-debug ways. CFS sidesteps the whole question — by definition, a task that just woke up has accumulated less vruntime than the busy tasks, so it gets to run immediately. Interactivity falls out of the math for free.
When CFS is the right scheduler
- General-purpose workloads. Desktops, servers, containers, almost everything. The default class (SCHED_NORMAL) on Linux is CFS — you're using it right now to read this page.
- Mixed CPU- and I/O-bound mix. CFS handles "kernel build with two browsers and a music player" without any tuning. The lower-vruntime-wins rule naturally gives I/O wakeups good latency.
- Cgroup hierarchies for containers. CFS supports nested run queues per cgroup, so CPU shares can be allocated to whole groups of tasks — the foundation of Kubernetes CPU limits and Docker's --cpu-shares.
For hard real-time work (audio processing, control systems, network ASIC drivers), use SCHED_FIFO or SCHED_DEADLINE — these preempt CFS entirely. For dedicated game-server style workloads, sched_ext (eBPF schedulers, kernel 6.12+) is increasingly attractive.
Pseudo-code
// CFS pick-next-task, simplified.
pick_next_task_fair(rq):
se = rq.cfs_rq.rb_leftmost // cached pointer
return task_of(se)
// On each scheduler tick, update the running task's vruntime
// and check whether it should be preempted.
entity_tick(curr, delta_exec_ns):
curr.vruntime += delta_exec_ns × NICE_0_LOAD / curr.weight
if (curr.vruntime - rb_leftmost.vruntime) >= sched_slice(curr):
resched_curr(rq) // mark for preemption
sched_slice(task):
n = nr_running
period = (n <= sched_nr_latency)
? sched_latency
: n × sched_min_granularity
return period × task.weight / total_weight
// When a task wakes up
enqueue_task_fair(rq, task):
// bound vruntime so a long-sleeping task doesn't monopolize CPU
task.vruntime = max(task.vruntime, rq.min_vruntime - sched_latency / 2)
rb_insert(rq.cfs_rq.tree, task)
update_rb_leftmost(rq.cfs_rq)
Inside the kernel (real code)
// kernel/sched/fair.c — simplified excerpt.
static struct sched_entity *
__pick_next_entity(struct sched_entity *se)
{
struct rb_node *next = rb_next(&se->run_node);
if (!next) return NULL;
return rb_entry(next, struct sched_entity, run_node);
}
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
u64 now = rq_clock_task(rq_of(cfs_rq));
u64 delta_exec;
if (unlikely(!curr)) return;
delta_exec = now - curr->exec_start;
if ((s64)delta_exec <= 0) return;
curr->exec_start = now;
curr->sum_exec_runtime += delta_exec;
curr->vruntime += calc_delta_fair(delta_exec, curr);
update_min_vruntime(cfs_rq);
}
The whole CFS picker is a few hundred lines in kernel/sched/fair.c. You can read it. It's surprisingly approachable code for a piece of infrastructure that schedules every process on a billion Linux machines.
Cost analysis
- Pick next task: ~50 ns. Just dereference
rb_leftmost. - Re-insert after a slice: O(log n) — typically < 100 ns even with hundreds of runnable tasks because the tree fits in cache.
- Context switch (CFS overhead alone): sub-microsecond. The bulk of context-switch cost (1–5 µs total) is register save/restore and TLB effects, not the scheduler decision.
- Wakeup-to-run latency: typically < 50 µs under load on idle cores; can spike to milliseconds under heavy contention.
- Maximum runnable tasks: no hard limit; servers with 100k threads work fine, though the per-CPU run queue's vruntime invariants get tested.
Common pitfalls and surprises
- "My CPU-bound task is starving." It's not — it's getting its fair share. If you actually need it to run more, increase its priority with
nice -n -5(or, better, use SCHED_FIFO if it's genuinely real-time). - Cgroup CPU shares don't behave like quotas. CPU shares are relative: a cgroup with shares=2048 gets 2× a cgroup with shares=1024 only if both are CPU-bound. If one is idle, the other gets 100%. For hard caps, use
cpu.cfs_quota_us. - Sleeping tasks accumulate "credit" — but not unbounded. When a task wakes after a long sleep, CFS bounds its vruntime to
min_vruntime - sched_latency/2so it can't dominate the CPU as compensation. Without this, a 10-minute-sleeping task would run uninterrupted for minutes. - NUMA matters. CFS load-balances across CPUs but tries to keep tasks on their home node. If your workload is sensitive, tune the NUMA balancer separately — CFS's per-CPU view doesn't see the full picture.
- EEVDF in 6.6+. If you're on a modern kernel, you're already past CFS — EEVDF replaced it as the SCHED_NORMAL implementation. The vruntime tree is still there, but the picking adds eligibility (a task can run only if its vruntime is below average) and uses virtual deadlines. The fairness story is similar; latency is better.
Frequently asked questions
What does CFS stand for?
Completely Fair Scheduler. Linux's default CPU scheduler since kernel 2.6.23 in October 2007, written by Ingo Molnar. It replaced the O(1) scheduler — which was fast but produced visibly unfair latency under load.
What is vruntime?
Virtual runtime — how long a task has actually run on the CPU, weighted by its niceness. A nice-0 task accumulates vruntime at one nanosecond per nanosecond of real time. A nice-19 task accumulates ~30× as fast (so it gets less CPU). CFS always picks the task with the smallest vruntime.
Why a red-black tree instead of a queue?
CFS needs to find the smallest-vruntime task in O(log n) and re-insert a task with an updated vruntime in O(log n). A red-black tree gives both. A linked-list queue would be O(n) for the lookup. The leftmost node of the tree is always the next task to run — so picking is actually amortized O(1) with a cached pointer.
How long does each task run before being preempted?
It depends on how many tasks are competing. CFS targets a sched_latency of 6 ms — the period in which every runnable task should get a turn. That latency is divided among all runnable tasks weighted by their priority. With one task, it runs 6 ms. With 8 tasks, each gets 0.75 ms. The minimum slice is sched_min_granularity, typically 0.75 ms, to prevent thrashing when there are many tasks.
How does nice value affect scheduling?
Each nice value maps to a weight via a lookup table (the prio_to_weight array). Each step of niceness changes the weight by a factor of ~1.25 — so a nice-(-5) task has ~3× the CPU share of a nice-0 task, and a nice-(+5) task has ~1/3 the share. The total CPU share is your weight divided by the sum of all weights in the run queue.
What's the difference between CFS and a real-time scheduler?
Real-time schedulers (SCHED_FIFO, SCHED_RR, SCHED_DEADLINE) give strict priority — a high-priority task runs to completion or until it yields. CFS is for normal tasks and gives proportional fairness: every runnable task makes progress, weighted by its nice value. Linux runs CFS for the SCHED_NORMAL (a.k.a. SCHED_OTHER) class; real-time classes have higher priority and preempt CFS.
Is CFS still the default Linux scheduler?
Mostly. In kernel 6.6 (October 2023) CFS was effectively replaced by EEVDF — Earliest Eligible Virtual Deadline First — which builds on CFS's vruntime accounting but adds deadlines for better latency guarantees. From an API perspective it behaves like CFS; under the hood, the picking rule changed. The red-black tree is still there.