Async Runtime
Green Threads
Millions of threads on a handful of CPUs
Green threads are user-space threads scheduled by a runtime. Millions per process possible; switching never enters the kernel.
- Threading modelM:N (M user, N OS)
- Goroutine initial stack2 KB (grows on demand)
- Goroutines per processMillions tested
- Switch cost~200 ns (Go)
- Java Loom virtual threadsStable in JDK 21 (2023)
- Erlang processes~300 words / proc
Interactive visualization
Four OS threads carrying hundreds of green threads. Watch the runtime move them around.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How green threads work
An OS thread is a heavyweight thing. The kernel allocates an 8 MB virtual stack, registers it in the process's task table, and schedules it across CPU cores using the same machinery — CFS, EEVDF — that schedules every other thread on the system. Creating one costs tens of microseconds. Switching between two threads costs 1–5 µs. Ten thousand threads on a 64-core machine is achievable but pushes you against virtual-memory limits and scheduler overhead.
Green threads turn the problem inside out. Instead of asking the kernel to manage a million threads, the language runtime creates a handful of OS threads ("workers" or "carriers") and runs its own scheduler on top of them. Each green thread is a tiny data structure — Go goroutines start at 2 KB; Erlang processes weigh around 300 words. The runtime can create them, switch between them, and destroy them entirely in user space.
This is called the M:N model: M green threads multiplexed onto N OS threads. M is in the millions; N is usually equal to the number of CPU cores. When a green thread is ready to run, the scheduler picks any free OS worker and gives it a CPU. When the green thread blocks on I/O, the runtime parks it and the OS worker picks up the next ready green thread — there's no kernel context switch in either direction.
Threading models
| Model | Mapping | Example | Trade-off |
|---|---|---|---|
| 1:1 | Each user thread = one OS thread | Linux pthreads, Win32 threads, Java pre-Loom | True parallelism, expensive switch, ~MB per thread |
| N:1 | All user threads on one OS thread | Java 1.0 green threads, Python asyncio | Cheap, but can't use multiple cores |
| M:N | Many user threads on N OS threads | Go goroutines, Erlang, Java virtual threads, Rust Tokio | Cheap + parallel; runtime is complex |
| Hybrid | Runtime adjusts N based on workload | Erlang/OTP, Go since 1.14 | Best of M:N; some work-stealing magic |
The history is interesting: Java started with green threads in 1.0 (an N:1 model), abandoned them by 1.2 in favor of native pthreads (1:1), and then re-introduced them as virtual threads in JDK 21 (M:N). It took the industry 25 years to converge on what Erlang had in 1986 and Go shipped in 2009.
When to use green threads
- Massive I/O concurrency. 100k WebSocket connections. A chat server with a million users. An API gateway proxying to dozens of upstream services per request. Green threads turn "concurrency" from a precious resource into something you don't think about.
- Per-request worker patterns. A web server that wants one worker per incoming request, but can't afford one OS thread per request. Java's virtual threads make traditional thread-per-request servers viable at modern scale.
- Actor systems. Erlang/OTP, Akka, Elixir GenServers. Each actor is a green thread (Erlang calls them processes), receiving messages from a mailbox. Cheap process creation lets you model the world as millions of independent actors.
- Pipelined work. Go's
go func()+ channels lets you express pipelines as a graph of goroutines connected by channels. Each stage is a goroutine.
Green threads don't help for purely CPU-bound work. A million goroutines doing FFTs doesn't run faster than a goroutine pool sized to the number of cores. For that, use a goroutine pool, or — for tightly coupled compute — a thread pool with explicit task queues.
Preemption — the hard part
The classic green-thread model is purely cooperative: a green thread runs until it yields voluntarily. Every I/O operation, channel send/receive, sleep, or mutex acquisition is a yield point. In practice this is fine for I/O-bound code — which is exactly what green threads are good at — but it breaks horribly for CPU-bound code with no I/O calls.
Imagine 100 goroutines, one of which goes into a tight numeric loop with no yield points. On a 4-core box, the runtime has 4 OS workers. After a few seconds, that one goroutine will have been picked up by one worker, will run forever, and the other 99 goroutines on the same worker queue can't run. From the user's perspective: the program freezes.
Modern runtimes solve this in three ways:
- Compiler-inserted yield checks (Go < 1.14). Every function prologue checked a stack-grow flag; the runtime could set the flag to force a yield. Worked but missed tight inner loops with no function calls.
- Signal-based preemption (Go ≥ 1.14, Java Loom). The runtime sends a SIGURG to the worker thread, which lands in a handler that suspends the running goroutine at the current instruction (with stack metadata from the compiler making this safe). Works for tight loops too.
- Yielding only at safepoints. Java's JIT inserts safepoint polls into loops; on yield request, the JVM uses those to safely suspend.
The bottom line: in modern green-thread runtimes, a CPU-bound goroutine eventually gets preempted (typically within ~10 ms), so it can't completely starve the others. But it still hogs one OS worker. The right move for known CPU-heavy work is still to use a separate worker pool.
Pseudo-code of an M:N scheduler
// Each OS worker thread runs roughly this loop:
worker_loop(local_runq, global_runq):
while not shutdown:
g = local_runq.pop()
if g is None:
g = global_runq.pop()
if g is None:
g = work_stealing(other_workers) // see /cs/work-stealing
if g is None:
sleep_until_event()
continue
// Resume the green thread on this OS worker
save_worker_context()
load_green_thread_context(g)
run_until_yield_or_block(g)
save_green_thread_context(g)
load_worker_context()
if g.state == BLOCKED_ON_IO:
register_io_wakeup(g)
elif g.state == RUNNABLE:
local_runq.push(g)
elif g.state == FINISHED:
free(g)
A Go example
package main
import (
"fmt"
"net/http"
"sync"
)
// Fetch 1000 URLs concurrently — each fetch is one goroutine.
func main() {
var wg sync.WaitGroup
urls := make([]string, 1000)
for i := range urls {
urls[i] = fmt.Sprintf("https://api.example.com/%d", i)
}
for _, u := range urls {
wg.Add(1)
go func(url string) {
defer wg.Done()
resp, err := http.Get(url)
if err != nil { return }
resp.Body.Close()
}(u)
}
wg.Wait()
}
1000 goroutines, each blocking on HTTP I/O. Total memory: about 2 MB of goroutine stacks plus connection state. The runtime keeps a few OS threads busy multiplexing them all through epoll under the hood. On a laptop you can scale this loop to a million URLs and it just works.
Java virtual threads
// JDK 21+
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
for (int i = 0; i < 100_000; i++) {
executor.submit(() -> {
// Looks like an OS thread — isn't. Costs ~hundreds of bytes.
HttpClient.newHttpClient()
.send(HttpRequest.newBuilder().uri(URI.create("...")).build(),
HttpResponse.BodyHandlers.discarding());
return null;
});
}
} // implicit join
Same Thread/ExecutorService API as Java has had since 1.5. The difference: each "thread" here is a virtual thread, scheduled by the JVM on a small carrier pool. The classic thread-per-request pattern is back, but now scales to hundreds of thousands of in-flight requests.
Performance and cost
- Goroutine creation: ~1 µs. Allocate a 2 KB stack and a goroutine struct. The runtime pools recently freed structs.
- Goroutine switch: ~200 ns. Save callee-saved regs, swap stack pointer, restore the other goroutine's regs. No syscall.
- OS thread switch (for comparison): ~1–5 µs. With syscall, scheduler, and TLB.
- Maximum goroutines: ~10 million tested in standard benchmarks; 1 million is a comfortable working size on a modern server.
- Carrier (OS thread) count: Go uses GOMAXPROCS (defaults to NUM_CPUS). Java Loom defaults similarly. Increasing N rarely helps for I/O-bound work.
- Erlang process memory: ~300 words (≈2.4 KB on 64-bit), GC'd per-process — one process's GC pause never affects others.
Common pitfalls
- Blocking the carrier with syscalls the runtime doesn't intercept. A goroutine that calls an FFI / cgo function that internally calls
read()on a non-Go-managed fd will park the whole OS worker. Go's runtime detects long blocking and spins up an extra worker; smaller runtimes can deadlock. - Goroutine leaks. A goroutine waiting on a channel that no one will ever send to runs forever. The runtime doesn't garbage-collect blocked goroutines because the channel they're waiting on is technically still reachable through them. Use contexts with cancellation.
- Channel buffer overruns. Channels with small buffers can become bottlenecks. Pick the buffer size deliberately — and remember that an unbuffered channel is a rendezvous, not a queue.
- GOMAXPROCS misconfiguration in containers. Pre-Go 1.5 defaulted to 1. Even on modern Go in a CPU-limited container, GOMAXPROCS may still see the host's full CPU count, not your cgroup's limit. Use
automaxprocsin container environments. - Stack growth surprises. Goroutines start with a small stack and grow on demand. A deep recursion or huge local array can trigger stack grows mid-function; under load the cost adds up. Generally invisible, but worth knowing.
Frequently asked questions
What's the difference between a green thread and an OS thread?
An OS thread is scheduled by the kernel and has a kernel-allocated 8 MB virtual stack. A green thread is scheduled by a language runtime in user space, gets a tiny stack (Go starts at 2 KB and grows as needed), and never goes through a syscall to switch. The runtime multiplexes many green threads onto a small pool of OS threads — the M:N model.
How many green threads can I have?
Millions. Go can comfortably run 1 million goroutines on a server with a few GB of RAM — each goroutine takes ~2 KB minimum. Erlang programs routinely run 10 million processes. Java virtual threads claim similar numbers. Compare to OS threads, where 10,000 is already pushing your address space limits.
Are goroutines green threads?
Yes, with a twist. Goroutines are scheduled by the Go runtime in user space, multiplexed onto OS threads (the M:N model). They differ from classic green threads by supporting preemption — since Go 1.14, the runtime can preempt a goroutine at function call sites and signal-handler safe points, so a CPU-bound goroutine can't starve others. Classic green threads were purely cooperative.
What is Java's Project Loom?
Project Loom added virtual threads to Java (stable in JDK 21, 2023). A virtual thread is a green thread on top of the existing Thread API — your old thread-per-request code suddenly scales to millions of concurrent requests. The runtime parks a virtual thread when it blocks on I/O, freeing the carrier OS thread to run another. Same API; vastly better scaling.
Why can't green threads preempt CPU-bound code?
Classic green threads relied on cooperative yield points — every I/O, every channel operation, every sleep is a yield. A tight numeric loop with no yield points runs forever, blocking the whole carrier thread. Modern runtimes work around this by inserting yield checks at function call boundaries (Go does this), by using signal-based preemption (Java Loom uses safepoints), or by running CPU-heavy work on a dedicated thread pool.
What's the M:N threading model?
M user-space threads scheduled onto N OS threads, with M >> N. The user-space runtime decides which green thread runs on which OS thread. Compare to 1:1 (one OS thread per user thread, e.g. Linux pthreads) and N:1 (all user threads on one OS thread, classic green threads circa Java 1.0). M:N gives you cheap creation, true parallelism, and good locality if the runtime is smart.
Are green threads the same as coroutines?
Closely related. A coroutine is a function that yields cooperatively; a green thread is a scheduler unit with its own stack and may include preemption. Goroutines are stackful coroutines with preemption — so they're both. Java virtual threads are similar. Python's asyncio coroutines, by contrast, aren't really green threads because they don't have stacks and aren't preempted — they're purely cooperative state machines.