Machine Learning
Speculative Decoding
A small model guesses ahead; the big model checks the guesses for free
Speculative decoding speeds up large-language-model inference by having a small draft model propose several tokens at once, then verifying them in a single parallel forward pass of the big model — accepting only the prefix that matches what the big model would have sampled, so the output is mathematically identical to ordinary decoding while running 2–3× faster.
- Typical speedup2–3×
- Output qualityLossless
- Draft length K4–8 tokens
- Tokens committed / step1 to K+1
- Bottleneck exploitedMemory bandwidth
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The intuition: guess cheap, verify in bulk
A 70-billion-parameter language model generating text one token at a time spends almost all of its time doing the same boring thing: reading 140 GB of weights out of GPU memory, multiplying them against a single column vector, and emitting one token. The arithmetic is trivial; the bottleneck is the memory read. The GPU's thousands of math units sit nearly idle, waiting on the memory bus. This is the memory-bandwidth wall, and it's why a big model feels slow even on a fast GPU.
Here's the trick. A forward pass that scores ten positions at once costs almost the same wall-clock time as a forward pass that scores one — because you read the weights once either way, and the extra math fills the idle compute units. So if we could somehow line up ten candidate tokens, we could check all of them in roughly the time it normally takes to produce a single token.
Speculative decoding lines them up with a second, much smaller model. The small draft model cheaply guesses the next handful of tokens. The big target model then verifies that whole guess in one parallel pass and keeps the longest correct prefix. When the draft is good, you commit five tokens for the price of one. When it's bad, you fall back to exactly the behavior you'd have had anyway. It was introduced concurrently in 2022–2023 by Yaniv Leviathan, Matan Kalman and Yossi Matias at Google ("Fast Inference from Transformers via Speculative Decoding") and by Charlie Chen and colleagues at DeepMind ("Accelerating Large Language Model Decoding with Speculative Sampling").
The draft-and-verify loop
One step of speculative decoding does this:
- Draft. Starting from the current text, run the small model autoregressively to propose
Ktokens (say K = 5). This is cheap — the draft model might be 10–20× smaller than the target. Save the draft model's probability for each proposed token:q(x₁), q(x₂), …, q(x_K). - Verify in parallel. Feed all K draft tokens to the big model in a single forward pass. Because transformers process a whole sequence at once, this one pass yields the target distribution at every one of the K+1 positions:
p(·), p(·|x₁), …, p(·|x₁…x_K). One pass, K+1 next-token distributions — that's the whole magic. - Accept or reject, left to right. For each draft token
xᵢ, accept it with probabilitymin(1, p(xᵢ) / q(xᵢ)). Stop at the first rejection. - Repair the rejection. At the first rejected position, sample one fresh token from the residual distribution
normalize(max(0, p − q)). If all K were accepted, sample a free bonus token straight frompat position K+1.
So a single step commits anywhere from 1 token (immediate rejection) to K+1 tokens (everything accepted plus the bonus). It can never stall — even a terrible draft model yields the same one-token-per-step throughput as plain decoding, just with extra overhead. The remarkable part is step 3 + 4: this exact accept/reject rule is what makes the result provably identical to sampling from the target model directly.
Why the accept/reject rule is lossless
The verification rule is speculative sampling, a form of rejection sampling. We want every committed token to be distributed as if drawn from the target p, even though it was proposed from the draft q. The rule:
draft proposes x ~ q
accept x with probability min(1, p(x) / q(x))
on reject, resample x' ~ norm(max(0, p − q))
The proof that the resulting token is exactly p-distributed: the probability of emitting some token x is the probability the draft proposes it and it's accepted, plus the probability the draft proposes something else, that gets rejected, and the residual happens to produce x. The accept term contributes q(x)·min(1, p(x)/q(x)) = min(q(x), p(x)). The residual term fills in exactly the gap max(0, p(x) − q(x)). Their sum is min(q,p) + max(0, p−q) = p(x). The draft's mistakes are mathematically erased.
The expected speedup has a clean closed form. Let α be the per-token acceptance rate (how often a draft token survives). With a draft of length K, the expected number of tokens committed per step is a geometric sum:
E[tokens per step] = (1 − α^(K+1)) / (1 − α)
If verifying K drafts costs c times one target step and drafting is free, the wall-clock speedup is roughly (1 − α^(K+1)) / ((1 − α)(K·c_draft + c)). The headline: speedup is governed almost entirely by α. At α = 0.7 with K = 5 you commit about 2.9 tokens per step; at α = 0.9 you commit about 4.7. The acceptance rate is itself a function of how close the draft distribution sits to the target — i.e., how good a mimic the small model is.
When speculative decoding pays off
- Low-batch, latency-sensitive serving. Single-user chat, autocomplete, agent loops — anywhere decoding is memory-bandwidth bound and the GPU's compute sits idle. This is the home-run case.
- Predictable, repetitive outputs. Code, structured JSON, and summarization (which echoes the prompt) have high acceptance rates, so the draft is right more often.
- You already have a smaller model in the same family. A 7B drafting for a 70B of the same tokenizer and training data tends to agree often. Mismatched tokenizers are a non-starter — the token IDs must align.
It's a poor fit when the GPU is already compute-bound: very large serving batches, or tiny models that are fast anyway. There's no idle compute to reclaim, so the extra verification work just competes with real requests. It also adds memory pressure — you're now holding two sets of weights and two KV caches resident.
How it compares to other inference speedups
| Speculative decoding | Quantization (INT8/INT4) | Knowledge distillation | KV-cache reuse | Beam search | |
|---|---|---|---|---|---|
| Output identical to base model? | Yes (lossless) | No (small quality drop) | No (new, smaller model) | Yes | N/A (changes objective) |
| What it attacks | Memory-bandwidth idle time | Weight memory + bandwidth | Total model size | Redundant prefix recompute | Search quality, not speed |
| Typical speedup | 2–3× latency | 2–4× memory, ~1.5–2× speed | 10×+ if you accept the smaller model | Avoids O(n²) recompute | Slower (B× the work) |
| Extra memory cost | Second model + 2nd KV cache | Lower (smaller weights) | Lower (smaller weights) | The cache itself | B parallel hypotheses |
| Helps at large batch? | Diminishes (compute-bound) | Yes | Yes | Yes | N/A |
| Training required? | No (with an existing draft) or light (Medusa/EAGLE) | Optional (QAT) | Yes (full distill run) | No | No |
| Composes with the others? | Yes — stack on top of all of them | Yes | Yes | Yes | Rarely combined |
The crucial column is the first row. Speculative decoding is the only entry that is provably lossless while still cutting latency, and it stacks with everything else: quantize both models, reuse both KV caches, and speculate on top. That orthogonality is why it became a default in production serving stacks (vLLM, TensorRT-LLM, llama.cpp) within two years of the papers.
What the numbers actually say
- 2–3× wall-clock latency reduction is the reproducible headline across the original papers and production stacks, for greedy or low-temperature decoding at small batch. Google's 2023 paper reported 2–3× on T5-XXL translation/summarization; DeepMind reported 2–2.5× on Chinchilla.
- A forward pass over K = 8 positions is only ~1.1–1.3× the cost of K = 1 on memory-bound hardware, because the dominant cost — reading the weights — is paid once. That near-free parallelism is the entire economic basis of the method.
- Acceptance rate drives everything. A well-matched 7B/70B pair hits α ≈ 0.8 on code, giving ~3.7 tokens committed per step at K = 5. A poorly matched pair at α ≈ 0.4 yields only ~1.6 tokens — barely worth the overhead.
- EAGLE-2 (2024) reports up to ~3.5–4× speedups by drafting in the target model's feature space and using a dynamic draft tree, materially beating a vanilla separate-model draft.
- Prompt-lookup decoding gives 2×+ on summarization/editing for zero extra parameters — it "drafts" by copying n-grams from the input, which the output frequently repeats verbatim.
JavaScript implementation
A faithful single-step of speculative sampling. The models are abstracted as functions returning next-token probability distributions; the accept/reject and residual logic is the part that must be exact.
// p, q are arrays over the vocabulary (probabilities summing to 1).
// draftModel(ctx) -> q distribution for the next token
// targetBatch(ctx, drafts) -> array of K+1 target distributions p
// (one parallel forward pass over all draft positions)
function sampleFrom(dist, rand = Math.random) {
let r = rand(), acc = 0;
for (let i = 0; i < dist.length; i++) { acc += dist[i]; if (r <= acc) return i; }
return dist.length - 1;
}
function residual(p, q) {
const out = new Array(p.length);
let sum = 0;
for (let i = 0; i < p.length; i++) { out[i] = Math.max(0, p[i] - q[i]); sum += out[i]; }
if (sum === 0) return p.slice(); // degenerate: fall back to p
for (let i = 0; i < out.length; i++) out[i] /= sum;
return out;
}
function speculativeStep(ctx, draftModel, targetBatch, K, rand = Math.random) {
// 1. DRAFT: small model proposes K tokens, remembering its own q(xᵢ)
const drafts = [], qProbs = [];
let dctx = ctx.slice();
for (let i = 0; i < K; i++) {
const q = draftModel(dctx);
const x = sampleFrom(q, rand);
drafts.push(x); qProbs.push(q[x]);
dctx.push(x);
}
// 2. VERIFY: ONE parallel pass gives target p at all K+1 positions
const P = targetBatch(ctx, drafts); // length K+1
// 3. ACCEPT left-to-right
const accepted = [];
for (let i = 0; i < K; i++) {
const x = drafts[i];
const ratio = qProbs[i] > 0 ? P[i][x] / qProbs[i] : 0;
if (rand() < Math.min(1, ratio)) {
accepted.push(x); // keep this token
} else {
// 4a. REPAIR: resample from residual max(0, p − q)
const q = draftModel(ctx.concat(accepted));
accepted.push(sampleFrom(residual(P[i], q), rand));
return accepted; // stop at first rejection
}
}
// 4b. BONUS: all K accepted — free token from p at position K+1
accepted.push(sampleFrom(P[K], rand));
return accepted; // committed 1..K+1 tokens
}
Two details that bite people. First, qProbs[i] must be the draft probability of the token the draft actually proposed, captured during drafting — not recomputed afterward. Second, the residual must be renormalized and guarded against the all-zero case (which happens when p and q agree exactly on the rejected token).
Python implementation
The same step in NumPy, closer to how a real tensor framework would express it, plus the outer generation loop.
import numpy as np
def sample_from(dist, rng):
return int(rng.choice(len(dist), p=dist))
def residual(p, q):
r = np.maximum(0.0, p - q)
s = r.sum()
return p.copy() if s == 0 else r / s
def speculative_step(ctx, draft_model, target_batch, K, rng):
# 1. DRAFT K tokens, remembering q(x_i)
drafts, q_probs, dctx = [], [], list(ctx)
for _ in range(K):
q = draft_model(dctx) # vocab-sized prob vector
x = sample_from(q, rng)
drafts.append(x); q_probs.append(q[x]); dctx.append(x)
# 2. VERIFY: one parallel pass -> K+1 target distributions
P = target_batch(ctx, drafts) # shape (K+1, vocab)
# 3-4. ACCEPT / REPAIR / BONUS
accepted = []
for i, x in enumerate(drafts):
ratio = P[i][x] / q_probs[i] if q_probs[i] > 0 else 0.0
if rng.random() < min(1.0, ratio):
accepted.append(x)
else:
q = draft_model(list(ctx) + accepted)
accepted.append(sample_from(residual(P[i], q), rng))
return accepted # first rejection ends the step
accepted.append(sample_from(P[K], rng)) # bonus token
return accepted
def generate(ctx, draft_model, target_batch, K, n_tokens, rng):
out = list(ctx)
while len(out) - len(ctx) < n_tokens:
out += speculative_step(out, draft_model, target_batch, K, rng)
return out[:len(ctx) + n_tokens]
In a production kernel the target_batch call is the only target forward pass per step, and it must also write the accepted tokens' keys/values into the KV cache while rolling back the rejected ones — getting that cache bookkeeping wrong is the most common correctness bug (see below).
Variants worth knowing
Speculative sampling (Leviathan et al. / Chen et al., 2023). The canonical two-model scheme described above. Lossless for both greedy and temperature sampling.
Medusa (2024). Drops the separate draft model entirely. It bolts several lightweight "Medusa heads" onto the target model that each predict a token a few positions into the future in parallel, then verifies candidate combinations with a tree-attention mask. Self-speculation — one model, no tokenizer-matching headache.
EAGLE / EAGLE-2 (2024). Drafts in the target model's feature (hidden-state) space rather than token space, which mimics the target far more faithfully and pushes acceptance rates up. EAGLE-2 adds a dynamic draft tree whose shape adapts to context-dependent confidence, reporting the highest published speedups (~3.5–4×).
Prompt-lookup / n-gram decoding. No neural draft at all: propose the next tokens by finding the most recent n-gram in the existing context and copying what followed it. Nearly free, and shockingly effective for summarization, RAG, and code editing where the output quotes the input.
Tree / batched speculation (SpecInfer, Sequoia). Instead of one linear draft, propose a tree of candidate continuations and verify them all in one batched pass with a custom attention mask. More candidates per verification step raises the odds that some path is fully accepted.
Common bugs and edge cases
- KV-cache rollback. The target's parallel pass populates cache entries for all K draft positions. On a rejection at position i, you must truncate the cache back to i (plus the one repaired token). Forgetting this poisons every subsequent step with stale keys/values — the single nastiest bug in real implementations.
- Recomputing q instead of caching it. The acceptance ratio uses the draft probability of the proposed token at draft time. If you recompute q after the fact (with a different context length) the ratio is wrong and the output is no longer lossless.
- Mismatched tokenizers. Draft and target must share a vocabulary; token ID 4021 has to mean the same string in both. A draft from a different tokenizer family can't be verified at all.
- Temperature/top-p mismatch. Both models must be sampled under the same temperature and truncation for the proof to hold. Apply top-k/top-p to both p and q identically before the accept test, or you silently break losslessness.
- Over-long drafts. Pushing K to 16 looks tempting but later positions almost always reject, so you pay draft-model time for tokens you throw away. Tune K to the measured acceptance rate.
- Assuming it always helps. At high serving batch sizes the GPU is compute-bound, the spare cycles vanish, and speculation can be net-negative. Benchmark at your actual batch size, not at batch 1.
- The all-accepted residual. When every draft token is accepted, don't forget the bonus token sampled from
pat position K+1 — dropping it leaves a tiny but real distributional bias and wastes a free token.
Frequently asked questions
Does speculative decoding change the model's output?
No. The accept/reject step is designed so the accepted tokens follow exactly the target model's distribution. With greedy decoding you get the identical token sequence; with sampling you get a sample from the identical distribution. It is a lossless speedup, not an approximation.
Why is verifying K tokens cheaper than generating them one at a time?
Autoregressive generation is memory-bandwidth bound: each token requires reading all of the model's weights from GPU memory, and the matrix-vector math barely uses the compute units. Verifying K draft tokens packs them into one forward pass as a batch of K positions, reading the weights once and turning the idle compute into useful work — so K tokens cost roughly the wall-clock time of one.
How many draft tokens should you propose per step?
Usually 4 to 8. Propose too few and you under-use the parallel verification; propose too many and the later tokens almost always get rejected, wasting draft-model time. The sweet spot depends on the acceptance rate, which depends on how well the draft model mimics the target.
What happens to the draft tokens that get rejected?
They are thrown away. When draft token i is rejected, every later draft token is discarded too, and the algorithm samples one corrected token from an adjusted residual distribution. So a step always commits at least one new token — it can never stall — and on average commits between one and K+1.
Do you need a separate small model to draft?
Not necessarily. Self-speculative methods like Medusa add extra prediction heads to the target model itself, and EAGLE drafts in the target's feature space. N-gram and prompt-lookup decoding skip the neural draft entirely by copying likely continuations from the prompt — great for summarization and code editing where output repeats the input.
Why doesn't speculative decoding help with very small batch-1 already-fast models?
The speedup comes from spare GPU compute that sits idle during memory-bound decoding. At large batch sizes the GPU is already compute-bound, so there is no idle headroom to fill — the extra verification work competes with real requests and the net gain shrinks or disappears.