Scheduling

Work Stealing

Idle workers reach into busy workers' queues

Work stealing balances load by letting idle workers steal tasks from busy workers' deques. Provably near-optimal scaling.

  • OriginCilk · Blumofe & Leiserson, 1994
  • Per-worker structureLock-free deque
  • Owner usesHead (LIFO push/pop)
  • Stealers useTail (FIFO steal)
  • Worst-case boundT₁/P + O(T∞)
  • SpeedupNear-linear with parallel slack

Interactive visualization

Watch four workers, each with a deque of tasks. An idle worker reaches across and pops from a busy worker's tail.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How work stealing works

Imagine a kitchen with four chefs, each at their own station, working from a stack of order tickets. The fastest chef tears through their stack and stands idle. The slowest is buried. Without coordination, the fast chef will sit waiting while the slow one falls further behind — terrible throughput. Work stealing is the rule: when you finish your stack, walk over to whichever chef looks busiest and grab the ticket at the bottom of their stack — the one they're least likely to get to soon.

Translated to schedulers: each worker thread owns a deque — a double-ended queue. The owner pushes and pops at one end (the head). When the owner runs out of tasks, it picks another worker at random, looks at their deque, and tries to steal from the other end (the tail). That's the entire algorithm. Three rules:

  1. Owner uses head. Push new tasks to head; pop from head when picking work. This is LIFO from the owner's perspective — depth-first execution, which is good for cache.
  2. Stealer uses tail. Reach across, atomically pop from tail. This is FIFO from the stealer's perspective — the oldest, coarsest-grained tasks are stolen.
  3. Steal only when idle. Don't steal eagerly. Only when your own deque is empty.

This sounds simple. The genius is in why it works.

Why opposite ends matter

Two reasons.

Cache locality. The owner's recent tasks — head of the deque — are typically the deepest part of a recursive computation. Their data is hot in the owner's cache. If a stealer grabbed those, the stolen task would run on a cold cache. By stealing from the tail, stealers take old, coarse-grained tasks whose data isn't in anyone's cache anyway — no cache miss penalty for moving them across cores.

Contention. A naive shared queue would have every worker hitting one head pointer with atomic operations. With per-worker deques accessed only at opposite ends, the owner's head and the stealer's tail are at different cache lines. The owner runs lock-free in its hot path, never touching the tail except via the deque's pop-tail dance. Stealers only contend with each other when fighting over the same target deque.

The classic Chase-Lev work-stealing deque from 2005 has a beautiful property: the owner's push and pop operations involve a single atomic compare-and-swap only in the boundary case where the deque has 0 or 1 elements. The rest of the time, push and pop are wait-free. Steal operations need one CAS to claim the task.

Work stealing vs work sharing vs central queue

Central queueWork sharingWork stealing
Coordination pointOne shared queueProducer-side migrationStealer-side migration
ContentionHigh (all workers hit it)Producer pays costOnly stealers contend
Idle worker behaviorBlock on queueWait to be pushedActively seek out work
Cache localityRandomProducer's data, not consumer'sOwner runs LIFO (good cache)
Cost when balancedHigh (synchronization)LowZero (no steals happen)
Cost when imbalancedResolved automaticallyMigration overheadO(P) steals to rebalance
Provable boundNoneNoneT₁/P + O(T∞)
Used inNaive thread poolsRare in modern runtimesCilk, Go, ForkJoinPool, Rayon, Tokio

Modern parallel runtimes overwhelmingly chose stealing. The reason: when load is balanced, stealing costs nothing. Each worker stays on its own deque, no inter-thread communication, no atomics. Only when one worker runs dry does any synchronization happen. That makes the common case fast and the imbalance case auto-correcting.

When work stealing fits

  • Fork-join parallelism with variable task sizes. Sorting a tree of objects where some subtrees are big and some are tiny. The work-stealer redistributes as cores finish their work.
  • Recursive divide-and-conquer. Parallel mergesort, FFT, matrix multiplication. The recursion explosion creates the parallel slack the algorithm needs to stay near-optimal.
  • Async runtimes with many small tasks. Tokio, Go, Erlang. Hundreds of thousands of tasks landing irregularly; work-stealing keeps every core fed.
  • Parallel stream operations. Java's stream.parallel() runs on ForkJoinPool's common pool, which is work-stealing under the hood.
  • Game/render engines. Unity's job system, Unreal's task graph, Bevy's ECS — all schedule small tasks across worker threads via work stealing.

Work stealing pays less when tasks are uniform and parallelism is shallow (e.g., a fixed parallel for-loop with equal-cost iterations) — in that case a simple parallel-for chunked across workers is fine. The benefit shows up when load is unpredictable.

Pseudo-code: the Chase-Lev deque skeleton

// Each worker has:
//   deque: lock-free array of task pointers
//   top:   atomic index (stealers' end)
//   bottom: atomic index (owner's end)

owner_push(task):
    b = bottom
    deque[b % size] = task
    bottom = b + 1                    // release-store, owner only

owner_pop():
    b = bottom - 1
    bottom = b
    t = top
    if t > b: bottom = t; return EMPTY
    task = deque[b % size]
    if t != b: return task             // common case, no CAS needed
    if CAS(top, t, t+1):
        bottom = t+1; return task
    bottom = t+1; return EMPTY

steal():
    t = top
    b = bottom
    if t >= b: return EMPTY
    task = deque[t % size]
    if CAS(top, t, t+1):
        return task
    return RETRY

// Idle worker's main loop:
worker_loop():
    while not done:
        task = owner_pop()
        if task == EMPTY:
            for attempt in 1..MAX:
                victim = random_worker()
                task = victim.steal()
                if task != EMPTY: break
            if task == EMPTY:
                park_until_signaled()
                continue
        execute(task)

The whole work-stealing scheduler fits in a couple hundred lines of carefully written lock-free code. The real implementations — JDK's ForkJoinPool, Go's runtime, Tokio — add load balancing heuristics, NUMA awareness, and adaptive parking, but the core stays this simple.

A real Rust example with Rayon

use rayon::prelude::*;

fn main() {
    let data: Vec<u64> = (0..1_000_000).collect();
    let sum: u64 = data.par_iter()
        .map(|&x| expensive(x))
        .sum();
    println!("{}", sum);
}

fn expensive(x: u64) -> u64 {
    // Variable-cost work — some values take longer than others
    let mut acc: u64 = x;
    for _ in 0..(x % 100 + 1) { acc = acc.wrapping_mul(31).wrapping_add(7); }
    acc
}

Rayon's par_iter() recursively splits the range into smaller and smaller sub-ranges. Each split becomes a task; tasks land on worker deques; idle workers steal from busy ones. The user wrote a sequential-looking iterator chain; Rayon scaled it across cores with near-linear speedup, even though expensive takes wildly different time for different inputs.

Cost analysis

  • Local push: ~10 ns. Increment bottom, write the task pointer.
  • Local pop (common case): ~10 ns. No CAS needed.
  • Steal: ~100 ns. One CAS on the target's top pointer plus the deque read. Successful steal costs ~200 ns including the cache transfer.
  • Idle worker overhead: a few hundred ns per attempt while spinning, then microseconds when parked.
  • Bound: T₁/P + O(T∞) running time. T₁ is sequential work; T∞ is critical-path length; P is workers. For programs with deep parallelism (T₁ ≫ P·T∞), this is within a factor of (1 + ε) of optimal.
  • Number of steal attempts in expectation: O(P·T∞) total across a whole computation. Idle workers steal a few times per work unit they consume.

Common pitfalls

  • Bulky tasks defeat the scheme. If your "task" is one big serial loop, only one worker runs it — the stealers find no slack. Break work into smaller tasks (Rayon's par_chunks, JDK's RecursiveTask with a base case).
  • Excessive task granularity. Too many tiny tasks and the steal overhead dominates. The classic rule is to spawn tasks until they're "small enough to be worth scheduling" — for arithmetic on arrays, somewhere around 1000–10000 elements per task is typical.
  • Pinning tasks to cores. Some workloads need affinity. Work stealing fights against that — a task can land on any core. Use a single-threaded scheduler or pinned task pools when affinity matters.
  • Recursive parking deadlock. Some early ForkJoinPool implementations could deadlock if a task waiting for a sub-task's result caused all workers to park. Modern implementations spawn helper threads or use continuations to avoid this. (Go avoids the problem by having every goroutine yield voluntarily.)
  • Stale references in deques. Lock-free deques have notoriously subtle correctness conditions. Don't write your own; use a vetted library (Crossbeam in Rust, ForkJoinPool in Java, Go's runtime).

Frequently asked questions

What is work stealing?

A scheduling discipline where each worker thread has its own deque (double-ended queue) of tasks. The owner pushes and pops from the head; idle workers steal from the tail of someone else's deque. This means each worker runs without contention when it has work, and idle workers find work without explicit hand-off.

Why steal from the tail and not the head?

Two reasons. First, contention: the owner is constantly touching the head, so the stealer touches the tail to minimize cache contention and atomic-cas conflicts. Second, locality: tasks at the head are the most recently spawned (typically the deepest subtree of a recursive computation, which has hot cache state). Tasks at the tail are older, coarser-grained — better to ship across cores.

Where is work stealing used in production?

Everywhere fork-join parallelism is used. Cilk popularized it in 1995. Intel's Threading Building Blocks (TBB) uses it. Java's ForkJoinPool (since Java 7, 2011) uses it — Stream.parallel() runs on it. Go's goroutine scheduler uses it. Rust's Rayon, Tokio multi-threaded runtime, Tokio's spawn_blocking pool — all work-stealing. Microsoft's TPL and .NET's Task scheduler — same.

Is work stealing provably optimal?

Yes, asymptotically. Blumofe and Leiserson (1994) proved that for fork-join programs with total work T1 and critical-path length T∞, a work-stealing scheduler on P workers takes time at most T1/P + O(T∞). For programs with parallel slack — many more available tasks than workers — this is within a constant factor of optimal. In practice, it gives near-linear speedup until the critical path dominates.

What happens when a worker has no work and no one to steal from?

The worker spins for a short period (waiting for new work to appear), then parks itself — using OS-level synchronization (futex, condition variable). When a busy worker spawns a new task, or another worker finds work, it wakes a parked one. The trade-off is between latency (spin longer to catch new work fast) and energy (park sooner to save CPU).

How does the Go scheduler use work stealing?

Each P (logical processor, one per OS worker) has a local runqueue of goroutines. New goroutines go on the spawner's local queue. When a worker runs out of goroutines, it tries: (1) the global runqueue, (2) stealing half from a randomly chosen P's local queue, (3) checking the network poller for I/O-completed goroutines. The steal target is randomized to avoid starvation.

What's the difference between work stealing and work sharing?

In work sharing, busy workers push new tasks to less-busy workers — the producer pays the migration cost. In work stealing, idle workers pull tasks from busy workers — the consumer pays. Stealing is better in practice because the busy worker shouldn't be interrupted to balance load (it's already productive); the idle worker has nothing better to do anyway. Stealing also avoids unnecessary migration when the load happens to be balanced.