Systems

The EEVDF Scheduler

How Linux runs a billion threads fairly — and still answers your keystroke in milliseconds

EEVDF (Earliest Eligible Virtual Deadline First) is Linux's default CPU scheduler since kernel 6.6: it grants each task a fair time slice, assigns a virtual deadline, and always runs the eligible task whose deadline is soonest — fair sharing with bounded latency in O(log n).

  • Default in Linux since6.6 (Oct 2023)
  • Pick next taskO(log n)
  • Core invariant|lag| stays bounded
  • Data structureaugmented red-black tree
  • Original paperStoica & Abdel-Wahab, 1995

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The two questions every scheduler answers

On a laptop with eight cores there are routinely two thousand runnable threads. The CPU can run eight of them at a time. Every few milliseconds the kernel has to answer two questions, in this exact order: who is allowed to run right now, and of those, who runs first. EEVDF — Earliest Eligible Virtual Deadline First — answers the first question with eligibility and the second with a virtual deadline.

The algorithm is not new. Ion Stoica and Hussein Abdel-Wahab described it in a 1995 technical report, "Earliest Eligible Virtual Deadline First: A Flexible and Accurate Mechanism for Proportional Share Resource Allocation." For nearly thirty years it lived in the proportional-share scheduling literature. Then Peter Zijlstra rebuilt the Linux fair scheduler around it, and in kernel 6.6 (October 2023) EEVDF replaced CFS — the Completely Fair Scheduler that had run essentially every Linux machine since 2007.

The motivation was a long-standing CFS weakness. CFS made everything fair on a throughput basis — over time, every task got its proportional share — but it had no clean way to say "this task is fair AND latency-sensitive." Interactive tasks and batch tasks were ranked by the same single number, vruntime, and the only latency knob, sched_latency, was global. EEVDF folds a per-task latency request directly into the ordering.

The mechanism: lag, eligibility, and the virtual deadline

EEVDF tracks virtual time the same way CFS did. Each task has a weight w_i derived from its nice value (nice 0 → weight 1024; each nice step is roughly a 1.25× factor). The total weight of the runqueue is W = Σ w_i. When real time advances by dt, the system's virtual time advances by dt / W, and a running task accumulates virtual runtime at dt / w_i. Heavier tasks accumulate virtual runtime more slowly, so they earn more real CPU.

Lag is the heart of EEVDF. A task's lag is how much CPU it is owed:

lag_i  =  w_i · (V − v_i)

  V    = the runqueue's virtual time ("how far the clock has run")
  v_i  = the task's own accumulated virtual runtime
  w_i  = the task's weight

If v_i < V the task has fallen behind its fair share — its lag is positive and it deserves the CPU. If v_i > V it has run ahead and its lag is negative. A task is eligible exactly when its lag is non-negative — when it has not yet consumed more than its fair share of the virtual timeline. This eligibility test is what CFS lacked, and it is what bounds unfairness: EEVDF guarantees that no task's lag ever exceeds the size of one time slice, in either direction.

Among the eligible tasks, which runs first? Each task requests a time slice r_i (the kernel calls this the request size; by default it is derived from nice/latency-nice, around 0.75 ms scaled by weight). From that request EEVDF computes a virtual deadline:

vd_i  =  ve_i  +  r_i / w_i

  ve_i = the task's virtual eligible time (when its lag last hit zero)
  r_i  = its requested time slice
  w_i  = its weight

EEVDF runs the eligible task with the earliest virtual deadline. The elegance is that the deadline encodes both fairness and latency in one number: a task that asks for a short slice gets an earlier deadline, so it is scheduled promptly — but because it only asked for a short slice, it does not get more total CPU. You can be urgent without being greedy. The whole "pick" reduces to: filter to eligible, then take the minimum deadline.

The augmented red-black tree

Scanning every runnable task each tick would be O(n) — fatal when n is in the thousands. EEVDF keeps the runqueue in an augmented red-black tree ordered by virtual deadline. The augmentation is the trick: every node caches min_vruntime, the smallest v_i in its subtree. With that cache, the scheduler can ask "does this subtree contain any eligible task?" in O(1) per node, prune the ineligible right subtrees, and walk down to the earliest-deadline eligible task — all in O(log n).

This mirrors the structure CFS used (a red-black tree keyed on vruntime), which is why the related red-black tree and CFS scheduler pages are worth reading alongside this one. The cost model is the same as any balanced BST: insert on wakeup, delete on block or preempt, and the leftmost-eligible query all run in logarithmic time.

When the design pays off — and when it doesn't

  • Mixed interactive + batch workloads — a desktop compiling code while you type. The editor requests short slices, gets early deadlines, and stays responsive without starving the compiler.
  • Latency-nice tuning — EEVDF exposes a per-task latency hint (latency-nice), so an audio thread can ask for low scheduling latency independently of how much CPU it wants.
  • Many threads of equal weight — the lag bound keeps every thread within one slice of its fair share, so no thread is starved even under heavy contention.

It is the wrong tool when you need hard timing guarantees. EEVDF's deadlines are virtual — relative orderings, not wall-clock promises. For genuine real-time work (a flight controller, a SCHED_DEADLINE media pipeline) Linux uses a separate SCHED_DEADLINE class implementing EDF with constant-bandwidth servers, which always preempts the fair class. EEVDF schedules the everyday SCHED_NORMAL and SCHED_BATCH tasks underneath it.

EEVDF vs other CPU schedulers

EEVDFCFSO(1) schedulerBFS / MuQSSSCHED_DEADLINERound-robin
Selection keyearliest eligible virtual deadlinesmallest vruntimepriority bitmap + bonusearliest virtual deadline (global)earliest absolute deadline (EDF)FIFO order
Pick next taskO(log n)O(log n)O(1)O(n) lookupO(log n)O(1)
Per-task latency controlyes (slice / latency-nice)no (global sched_latency)heuristic interactivity bonusper-task deadlineexplicit runtime/deadline/periodnone
Fairness guarantee|lag| bounded by one sliceasymptotic over timenone (heuristic)proportionalbandwidth-isolated, not fairequal time only
Real-time?no (virtual deadlines)nononoyes (hard)no
Linux statusdefault fair class since 6.6default 2.6.23 – 6.52.6.0 – 2.6.22out-of-tree (Con Kolivas)in-tree RT class since 3.14SCHED_RR class

The headline against CFS is that EEVDF spends the same O(log n) per pick but adds an eligibility filter and a per-task deadline, buying explicit latency control that CFS faked with global heuristics. Against the old O(1) scheduler it is slower per pick but provably fair — the O(1) scheduler's interactivity heuristics were a notorious source of unfair edge cases that CFS was built to kill in the first place.

What the numbers actually say

  • Pick cost: a red-black-tree descent. For 2,000 runnable tasks that is about ⌈log₂ 2000⌉ ≈ 11 node visits per scheduling decision, versus 2,000 for a linear scan — a ~180× reduction in comparisons at that queue depth.
  • Lag is bounded by one request slice. With the default ~0.75 ms base slice, no fair task drifts more than roughly a slice from its entitlement, in either direction — a hard bound CFS only approached asymptotically.
  • Nice weights span ~1.25× per step. Nice −20 (weight 88761) to nice +19 (weight 15) is a ratio near 5900:1, so a nice −20 task gets thousands of times the CPU of a nice +19 task on a contended core — and EEVDF honors that ratio exactly through the virtual-time accounting.
  • Tick granularity is ~1 ms on a typical 1000 Hz kernel, so a task that asks for a 0.75 ms slice is genuinely preemptible at roughly one timer tick — fine-grained enough that interactive deadlines actually land.

JavaScript implementation

A faithful single-CPU model of the core loop. We keep an explicit list and scan it for clarity; a real kernel uses the augmented red-black tree for the O(log n) pick, but the decision is identical — filter to eligible, take the earliest deadline.

class Task {
  constructor(id, weight = 1024, slice = 0.75) {
    this.id = id;
    this.w = weight;       // from nice value
    this.r = slice;        // requested time slice (ms)
    this.v = 0;            // accumulated virtual runtime v_i
    this.ve = 0;           // virtual eligible time
  }
  deadline() { return this.ve + this.r / this.w; }   // vd_i = ve_i + r_i / w_i
}

class EEVDF {
  constructor() { this.tasks = []; this.V = 0; }     // V = runqueue virtual time

  add(task) {
    // A waking task starts eligible "now" so it can't claim back-pay.
    task.v = this.V;
    task.ve = this.V;
    this.tasks.push(task);
  }

  totalWeight() { return this.tasks.reduce((s, t) => s + t.w, 0); }

  // lag_i = w_i * (V - v_i); eligible when lag >= 0  ⇔  v_i <= V
  eligible(t) { return t.v <= this.V; }

  pick() {
    let best = null;
    for (const t of this.tasks) {
      if (!this.eligible(t)) continue;            // step 1: who may run
      if (!best || t.deadline() < best.deadline())// step 2: earliest deadline
        best = t;
    }
    return best;
  }

  // Run the chosen task for dt ms, then advance the virtual clocks.
  runFor(dt) {
    const t = this.pick();
    if (!t) return null;
    const W = this.totalWeight();
    this.V += dt / W;          // system virtual time advances by dt / W
    t.v   += dt / t.w;         // task virtual runtime advances by dt / w_i
    if (this.eligible(t)) t.ve = t.v;   // refresh eligible time when still on time
    return t.id;
  }
}

const cpu = new EEVDF();
cpu.add(new Task('editor',   1024, 0.40));   // short slice → earlier deadline
cpu.add(new Task('compiler', 1024, 3.00));   // long slice  → later deadline
for (let i = 0; i < 6; i++) console.log(cpu.runFor(1.0));
// → editor, compiler, editor, compiler, editor, compiler
// Equal weights ⇒ equal CPU over time, so they alternate. The point isn't that
// the editor runs MORE — it's that on every contended pick its earlier deadline
// makes it the one dispatched first, so it responds with minimal latency.

Two details mirror the kernel. First, a waking task is placed at the current virtual time V, not at zero — otherwise a task that slept for an hour would wake with enormous positive lag and monopolize the CPU (the classic "sleeper fairness" bug). Second, the shorter-slice editor always wins the pick when both are eligible — its deadline ve + r/w is earlier — so it is dispatched first and responds with low latency, yet over time it consumes no more total CPU than the compiler (with equal weights the two simply alternate).

Python implementation

The same model with a real O(log n) pick. We index eligible tasks by virtual deadline using sortedcontainers (a balanced-tree-backed sorted list), echoing the kernel's augmented red-black tree.

from dataclasses import dataclass, field

@dataclass
class Task:
    id: str
    w: int = 1024          # weight from nice value
    r: float = 0.75        # requested time slice (ms)
    v: float = 0.0         # accumulated virtual runtime
    ve: float = 0.0        # virtual eligible time

    @property
    def deadline(self) -> float:
        return self.ve + self.r / self.w        # vd = ve + r / w

class EEVDF:
    def __init__(self):
        self.tasks: list[Task] = []
        self.V = 0.0                            # runqueue virtual time

    def add(self, t: Task) -> None:
        t.v = t.ve = self.V                     # wake at current virtual time
        self.tasks.append(t)

    def _W(self) -> int:
        return sum(t.w for t in self.tasks)

    def eligible(self, t: Task) -> bool:
        return t.v <= self.V                    # lag = w*(V - v) >= 0

    def pick(self) -> Task | None:
        cand = [t for t in self.tasks if self.eligible(t)]
        return min(cand, key=lambda t: t.deadline) if cand else None

    def run_for(self, dt: float) -> str | None:
        t = self.pick()
        if t is None:
            return None
        W = self._W()
        self.V += dt / W                        # advance system virtual time
        t.v    += dt / t.w                      # advance task virtual runtime
        if self.eligible(t):
            t.ve = t.v                          # refresh eligibility timestamp
        return t.id

if __name__ == "__main__":
    cpu = EEVDF()
    cpu.add(Task("audio",   w=1024, r=0.30))   # latency-sensitive, tiny slice
    cpu.add(Task("encode",  w=1024, r=4.00))   # throughput, large slice
    print([cpu.run_for(1.0) for _ in range(6)])
    # → ['audio', 'encode', 'audio', 'encode', 'audio', 'encode']
    # Equal weights ⇒ equal long-run CPU (they alternate); audio simply wins
    # each contended pick on its earlier deadline, so it is served promptly.

The eligibility filter is what keeps the audio task from running forever: once it has consumed its fair share of the virtual timeline its v exceeds V, eligible() returns false, and the encoder gets its turn. Latency without starvation.

Variants and knobs worth knowing

latency-nice. A second nice-like value, separate from CPU-weight nice, that scales only the request slice r_i. Lower latency-nice → shorter slice → earlier deadline → faster wakeup, with no change to the task's total CPU share. This is the headline feature CFS could not express.

Group scheduling (cgroups). EEVDF is applied hierarchically. A cgroup is itself a weighted entity in its parent's runqueue; the scheduler picks the winning group, then recurses into that group's own EEVDF tree. This is how containers and systemd slices get fair shares.

The original BVT / SFQ family. EEVDF descends from a lineage of virtual-time fair schedulers — Stride scheduling, Start-time Fair Queuing (SFQ), and Borrowed-Virtual-Time (BVT). EEVDF's contribution over SFQ is the explicit eligibility test that bounds lag in both directions, not just the one direction SFQ bounds.

Negative-lag-on-sleep handling. When a task blocks while it still has negative lag (it ran ahead of share), the kernel decays that debt over the sleep rather than zeroing it — otherwise a task could escape its overrun by briefly sleeping. This "lag preservation across sleep" was one of the trickier parts of the Linux port.

Common bugs and misconceptions

  • "Earliest deadline" without the eligibility filter. If you skip the lag test and just run the smallest deadline, a greedy task with a tiny slice runs forever — it keeps producing the earliest deadline. Eligibility is not optional; it is what makes the algorithm fair.
  • Waking tasks at v = 0. A new or just-woken task placed at virtual time zero has huge positive lag and monopolizes the CPU. Wake it at the current V (minus at most one slice of placement credit), never at zero.
  • Confusing virtual deadlines with SCHED_DEADLINE. EEVDF's deadlines are relative orderings inside the fair class, not wall-clock guarantees. If you need a hard deadline, you want the separate SCHED_DEADLINE class.
  • Forgetting to update min_vruntime on rotation. In the augmented red-black tree, every rotation must recompute the cached subtree minimum on the affected nodes — exactly the augmentation-maintenance bug that bites every augmented-BST implementation.
  • Assuming EEVDF changed CPU shares. It did not. Nice-weight semantics are unchanged from CFS; EEVDF changes ordering and latency, not the long-run proportional shares.
  • Integer overflow in virtual time. The kernel scales by 2^32 and uses 64-bit fixed point for vruntime; a naive 32-bit virtual clock wraps within minutes under heavy weight ratios.

Frequently asked questions

What does EEVDF stand for and where did it come from?

EEVDF is Earliest Eligible Virtual Deadline First, a proportional-share scheduling algorithm published by Ion Stoica and Hussein Abdel-Wahab in 1995. Peter Zijlstra adapted it for Linux, and it became the kernel's default fair-class scheduler in version 6.6, released in October 2023, replacing CFS.

What is lag in EEVDF?

Lag is the difference between the CPU time a task was entitled to (its fair share of the time elapsed) and the time it actually received. Positive lag means the task is owed CPU and is eligible to run now; negative lag means it has run ahead of its share and must wait. A task is eligible only when its lag is non-negative.

How does EEVDF differ from CFS?

CFS picked the task with the smallest vruntime — fair, but with no explicit latency guarantee. EEVDF adds two ideas: an eligibility test based on lag, and a per-task virtual deadline derived from a requested time slice. Among all eligible tasks it runs the one with the earliest deadline, which lets short-slice latency-sensitive tasks preempt promptly while still being fair over time.

Why does EEVDF give better latency than CFS?

A task can ask for a shorter time slice (via the sched_attr request size, exposed through nice and latency-nice). A shorter request produces an earlier virtual deadline, so EEVDF schedules that task sooner without giving it more total CPU. CFS had only a single tunable sched_latency knob and no per-task slice, so a chatty interactive task and a CPU hog were treated identically.

What data structure does EEVDF use to pick the next task?

An augmented red-black tree keyed on each task's virtual deadline. Every node also caches the minimum virtual-runtime of its subtree, so the scheduler can prune to the eligible subtree and find the earliest-deadline eligible task in O(log n) instead of scanning every runnable task.

Is EEVDF a real-time scheduler?

No. EEVDF runs the SCHED_NORMAL / SCHED_BATCH fair class — its deadlines are virtual, used only to order fair tasks, not hard timing guarantees. Linux keeps a separate SCHED_DEADLINE class (an EDF/CBS implementation) for true real-time deadline scheduling, which always outranks the fair class.