Async Runtime
Coroutines
Functions that pause themselves and pick up where they left off
Coroutines yield control explicitly to a scheduler. A switch costs ~50 ns vs ~1-5 µs for an OS thread. Stackless or stackful.
- Stackless switch~50 ns
- Stackful (goroutine) switch~200 ns
- OS thread switch~1–5 µs
- Switch overhead vs thread20–100× cheaper
- Memory per coroutine~hundreds of bytes
- SchedulingCooperative (yield)
Interactive visualization
Watch two coroutines yield control back and forth, their stacks staying alive across suspensions.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How coroutines work
A regular function call has one entry point and one exit. You call it, it runs to completion, returns a value, and disappears. A coroutine has multiple entry/exit points. You call it, it runs partway, yields a value (or just an indication that it's pausing), and stays suspended in mid-execution — local variables intact, instruction pointer remembered — until someone resumes it. Then it picks up after the yield as if nothing happened.
This pattern shows up everywhere modern asynchronous code lives. Python's async def functions, Kotlin's suspend functions, C++20's co_await, Rust's async fn, Go's goroutines, JavaScript's async/await, Lua's coroutine.yield — all coroutines. Different surface syntax, different scheduler designs, but the underlying machinery is the same.
The whole point is that you can write straight-line code that looks like it blocks (call read(), do something with the bytes), but the call to read actually yields back to a scheduler, lets other coroutines run, and only resumes your function when the I/O completes. You get the readability of threads with the performance of an event loop.
Stackless vs stackful
The fundamental design choice. Both forms behave the same from a user's perspective — call, yield, resume — but the implementation diverges sharply.
Stackful coroutines have their own dedicated stack, allocated when the coroutine is created. When the coroutine yields, the runtime saves the current CPU registers and stack pointer somewhere, and switches to another coroutine's stack. Lua coroutines, Go goroutines, and POSIX ucontext based coroutines work this way. You can yield from anywhere — including from deep inside a regular function called by the coroutine — because that nested call is using the coroutine's own stack.
Stackless coroutines don't get their own stack. The compiler rewrites the coroutine function into a state machine. Local variables become fields in a heap-allocated struct. The body becomes a giant switch over a resume token (the instruction pointer, basically). When you "yield," the function returns; when you "resume," the function is called again and the switch jumps to the right state. Python generators, C++20 coroutines, Rust async, and Kotlin's compiler-transformed suspend functions all use this model.
The stackless approach is cheaper — there's no second stack, just a small heap-allocated frame — but it has a limitation: you can only yield from inside the coroutine function itself, not from a nested call. This is why Rust's .await only works from async fn functions and Python's await only works from async def functions. The compiler has to see the yield to rewrite the function.
Coroutines vs threads vs callbacks
| Coroutines | OS threads | Callbacks | |
|---|---|---|---|
| Scheduling | Cooperative (explicit yield) | Preemptive (OS) | Driven by event loop |
| Context switch cost | ~50 ns – 200 ns | ~1–5 µs | Function call (~5 ns) per event |
| Memory per unit | Bytes – KB | ~8 MB virtual stack | ~bytes |
| Reads like | Sync code | Sync code | Inverted control flow |
| Race conditions inside one thread | None (no preemption) | Yes (locks needed) | None |
| True parallelism | Only if M:N runtime | Yes | No, single-threaded |
| Stack traces | Sometimes synthetic | Native | Lost across callback boundaries |
| Cancellation | Cooperative | Hard, signal-based | Cooperative |
Threads are still the simplest model when you have a small fixed number of concurrent operations and don't mind the per-thread memory cost. Coroutines win when you have thousands or millions of concurrent operations — typical server workloads where most coroutines spend most of their time waiting on I/O.
When coroutines fit
- I/O-bound concurrency at scale. A million HTTP requests in flight. A thousand database queries. A WebSocket server holding open connections. Anywhere the bottleneck is "wait for the network," coroutines absorb the wait at near-zero cost.
- Pipelines and generators. Producer-consumer chains where each stage transforms data and yields it. Python's
yield-based generators are pure coroutines, and they're how every iterator-based pipeline works. - Game-state machines and AI. Long-running logic that needs to pause for ticks.
co_awaitin C++20 lets you write AI behavior aswhile (true) { co_await delay(2s); attack(); co_await ... }. - UI event handling that wants linear code. JavaScript
async/await, Swift Combine, Kotlin coroutines on Android — all let you write event-driven logic as sequential blocks.
Don't reach for coroutines when the work is CPU-bound. A coroutine that never yields is just a function with extra overhead. CPU-heavy work belongs on threads or, better, on dedicated worker processes.
Pseudo-code: how stackless coroutines work
// What you write (Python-ish):
async def fetch_two():
a = await get('/foo')
b = await get('/bar')
return a + b
// What the compiler actually generates:
class FetchTwoCoroutine:
state = 0
a = None
awaitable = None
def resume(value):
if state == 0:
awaitable = get('/foo')
state = 1
return SUSPEND(awaitable)
if state == 1:
a = value // result of get('/foo')
awaitable = get('/bar')
state = 2
return SUSPEND(awaitable)
if state == 2:
b = value // result of get('/bar')
state = DONE
return RETURN(a + b)
The whole "function with awaits" is rewritten as a state machine. Local variables become fields. await becomes "save state, return an awaitable, wait to be resumed." The event loop or scheduler handles the resumption.
A real Python coroutine
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
async def fetch_many(urls):
async with aiohttp.ClientSession() as session:
# Schedule all fetches concurrently — each is a coroutine.
# gather() runs them on one event loop, one thread.
return await asyncio.gather(*(fetch(session, u) for u in urls))
# Fetch 1000 URLs concurrently — all in a single Python thread.
urls = [f"https://api.example.com/page/{i}" for i in range(1000)]
results = asyncio.run(fetch_many(urls))
That program launches 1000 concurrent HTTP requests in one Python thread. Each coroutine takes a few hundred bytes. The whole thing fits in megabytes of RAM. The equivalent thread-based program would need 1000 threads × 8 MB virtual stack = 8 GB of address space.
C++20 coroutines
#include <coroutine>
#include <iostream>
struct Generator {
struct promise_type {
int current_value;
Generator get_return_object() { return Generator{this}; }
std::suspend_always initial_suspend() { return {}; }
std::suspend_always final_suspend() noexcept { return {}; }
std::suspend_always yield_value(int v) { current_value = v; return {}; }
void return_void() {}
void unhandled_exception() { std::terminate(); }
};
promise_type* p;
bool next() { std::coroutine_handle<promise_type>::from_promise(*p).resume();
return !std::coroutine_handle<promise_type>::from_promise(*p).done(); }
int value() { return p->current_value; }
};
Generator counter() {
for (int i = 0; ; i++) co_yield i;
}
int main() {
Generator g = counter();
for (int n = 0; n < 5; n++) { g.next(); std::cout << g.value() << "\n"; }
}
C++20 makes you write the plumbing (the promise_type) but the body of counter() reads as straight-line code with co_yield. The compiler does the state-machine rewrite.
Cost analysis
- Stackless coroutine switch (compiled state machine): ~30–80 ns. Just a few register stores and an indirect jump.
- Stackful coroutine switch (Go goroutine): ~200 ns. Includes save/restore of callee-saved registers and stack pointer swap.
- OS thread context switch: ~1–5 µs. Syscall to the kernel, scheduler decision, register file save/restore, TLB effects, cache misses. 20–100× more expensive.
- Memory per stackless coroutine: a struct as wide as the function's locals. Often ~100 bytes.
- Memory per stackful coroutine (goroutine): starts at 2 KB, grows by remap on overflow. Most stay tiny.
- Concurrency ceiling: millions of coroutines per process. The bottleneck is your data, not the runtime.
Common pitfalls
- Forgetting to await. In Python, calling
foo()on anasync def fooreturns a coroutine object — it doesn't run. You have toawait foo()or schedule it viaasyncio.create_task. Forgetting yields a warning but no execution. - Blocking the event loop. A coroutine that calls a synchronous, blocking function (a regular file read, a CPU-bound loop) freezes the whole event loop. Use
loop.run_in_executoror equivalent to offload to a thread pool. - Lifetime of references. Stackless coroutines' "stack" is a heap struct. If a local variable in a coroutine is a reference to something on the caller's stack, and the caller's stack unwinds before the coroutine resumes, you have a dangling pointer. Rust's borrow checker catches this; C++ relies on you knowing what you're doing.
- Cancellation surprises. Cancelling an awaiting coroutine in Python raises
CancelledErrorat the await point. You're expected to clean up and re-raise. Swallowing the exception breaks shutdown — every async tutorial warns about it; everyone still hits it. - Mixing sync and async. "Function color" — async functions can call async or sync; sync functions can only call sync. Refactoring a deep sync function to be async means rewriting everything that calls it. The shape of an async API is contagious.
Frequently asked questions
What's the difference between a coroutine and a thread?
A thread is preempted by the OS — at any instruction the kernel can pause it and switch to another thread. A coroutine yields control voluntarily — only at explicit yield/await points does it surrender the CPU. That means coroutines never need locks against each other inside one thread, and a context switch costs ~50 ns instead of ~1–5 µs.
What's the difference between stackless and stackful coroutines?
A stackful coroutine has its own dedicated stack (a few KB up to MB). When suspended, the whole stack is kept around. Lua coroutines, Go goroutines, and POSIX ucontext-based coroutines work this way. A stackless coroutine is transformed by the compiler into a state machine — local variables become struct fields, control flow becomes a switch on a resume token. Python generators, C++20 coroutines, Rust async functions, Kotlin coroutines all use the stackless model. Stackless is cheaper but can't yield from inside an arbitrary nested function call.
How does Python's async/await work under the hood?
An async function is just a generator with extra plumbing. Each await suspends the function, returning an awaitable to the event loop. The event loop tracks which awaitables are pending (typically I/O on a socket via epoll) and resumes coroutines when their awaitables complete. The generator's state — local variables and the instruction pointer — is held in a heap-allocated frame object. Resume = call .send() with a result.
Why are coroutines so much cheaper than threads?
Three reasons. (1) Switching between coroutines never goes through the kernel — no syscall, no scheduler. (2) The 'context' is just the few words a coroutine actually uses, not the kernel's TCB plus the whole register file. (3) No TLB flushes, no cache pollution from kernel code. The net cost: ~50 ns for stackless coroutines, ~200 ns for goroutines, vs ~1–5 µs for an OS thread switch.
Can coroutines run on multiple CPU cores?
Only if the runtime schedules them across worker threads. Python's asyncio is single-threaded (one event loop, one thread). Go's goroutines and Java virtual threads multiplex onto a pool of OS threads (an M:N model), so they run truly in parallel. The runtime's job is moving coroutines between worker threads and balancing load.
What does 'yield' mean in a coroutine context?
It's the explicit suspension point. The coroutine produces a value (or none), gives control back to whoever resumed it, and saves its current state so the next resume picks up exactly where it left off. The original use was Python's yield in generators (return a value, suspend). Async/await is the same idea with a richer protocol — await means 'suspend until this awaitable completes.'
Are coroutines the same thing as green threads?
They're related but not identical. A green thread is a user-space scheduler unit that the runtime can preempt at safe points; a coroutine is a function that yields cooperatively. Goroutines are stackful coroutines with preemptive scheduling — they look like green threads from the outside, but use cooperative yields underneath, with safe points injected by the compiler for fair preemption. Java virtual threads are similar.