Machine Learning

Chain-of-Thought Prompting

Make the model show its work — and watch hard-problem accuracy jump

Chain-of-thought prompting asks a language model to write out its intermediate reasoning steps before the final answer, which sharply raises accuracy on multi-step arithmetic, logic, and word problems — at the cost of more output tokens.

IntroducedWei et al., 2022
GSM8K (PaLM 540B)18% → 57%
Zero-shot trigger"Let's think step by step"
Emerges at≈ 100B parameters
Token cost5–20× more output

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: thinking out loud buys computation

Ask a large language model "What is 17 × 24?" and it often blurts the wrong number. Ask it "What is 17 × 24? Let's think step by step," and it writes 17 × 24 = 17 × 20 + 17 × 4 = 340 + 68 = 408 — and gets it right. Nothing about the model's weights changed between those two prompts. The only difference is that the second one let it generate intermediate tokens. That is chain-of-thought prompting in one sentence: let the model write its reasoning before it commits to an answer.

The technique was named and benchmarked by Jason Wei and colleagues at Google in the January 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." A few months later, Takeshi Kojima and co-authors showed you don't even need worked examples — appending the single phrase Let's think step by step ("zero-shot CoT") triggers the same behavior. Both papers reported the same striking result: on multi-step problems, accuracy didn't creep up, it jumped.

Why should writing words on the way to an answer help at all? The answer is mechanical, and it's the most important idea on this page.

The mechanism: a transformer's compute is fixed per token

A decoder-only transformer does a constant amount of work to produce each token: the same number of layers, the same matrix multiplies, no loops, no recursion. If a model has L layers, then any single token is the result of exactly L sequential transformations — and that's the ceiling on how much serial reasoning can happen before that token appears.

Now suppose the correct answer to a problem genuinely requires more serial steps than the model has layers — a multiplication that needs several carries, a logic puzzle with a chain of deductions. There is no way to compute it inside the one forward pass that emits the answer directly. The model is forced to pattern-match a guess.

Chain-of-thought sidesteps this by spreading the computation across many tokens. Each token the model generates is appended to the context, and the next token gets a fresh forward pass that can read everything written so far. So if a problem needs k serial reasoning steps, the model can use roughly k generated tokens — each one another L-layer pass — instead of cramming all k steps into a single pass. The total serial compute available scales from O(L) to O(L · T) where T is the number of intermediate tokens.

This is not a metaphor. The 2024 theory paper "Chain of Thought Empowers Transformers to Solve Inherently Serial Problems" (Li, Liu, et al.) proved it: transformers with a polynomial-length chain of thought can solve problem classes that constant-depth transformers provably cannot. The scratchpad is a way to convert the model's fixed depth into effectively unbounded steps. Generated tokens are the model's working memory.

One consequence worth internalizing: the gain comes from the existence of intermediate computation, not from the prose being pretty. Even forcing the model to emit filler or to externalize a scratchpad of digits helps, because the bottleneck was serial depth, not eloquence.

When chain-of-thought helps — and when it hurts

CoT pays off when the task is compositional and multi-step, and the cost outweighs the benefit when it isn't:

Use it for: grade-school and competition math, symbolic logic, multi-hop question answering, code with non-trivial control flow, planning, and anything where a human would reach for scratch paper.
Skip it for: single-fact lookup, classification, extraction, sentiment, and short factual recall. Here the reasoning is pure latency and token cost, and CoT can occasionally talk the model out of a correct first instinct.
Watch the model size. CoT is an emergent ability — it does little for models under roughly 10B parameters and can degrade them, because small models generate fluent reasoning that is confidently wrong.
Mind the faithfulness gap. The printed reasoning is a useful scratchpad, not a transparent log of the model's internals. Don't treat it as an audit trail.

Chain-of-thought vs other prompting and inference strategies

	Direct (standard)	Few-shot CoT	Zero-shot CoT	Self-consistency	Tree of Thoughts	RAG
Prompt overhead	None	K worked examples	One trigger phrase	Same as CoT	CoT + search prompts	Retrieved passages
Inference calls	1	1	1	N samples (e.g. 40)	Many (branch + prune)	1 (+ retrieval)
Output tokens	Fewest	5–20× direct	5–20× direct	N × CoT	10–100× CoT	≈ direct
Fixes serial-depth limit	No	Yes	Yes	Yes (+ vote)	Yes (+ backtrack)	No (fixes knowledge)
Best at	Lookup, classify	Multi-step reasoning	Reasoning, no exemplars	Hard math, dedup errors	Search/planning problems	Knowledge-bound queries
Main cost	Accuracy on hard tasks	Latency, tokens	Latency, tokens	N× latency & cost	Large compute blow-up	Retrieval infra

The headline trade-off is compute for accuracy. Direct prompting is cheapest and fine for easy tasks; CoT trades tokens for correctness on hard ones; self-consistency and tree-of-thoughts trade many times the compute for a few more points; and retrieval-augmented generation solves a different axis entirely — it fixes what the model knows, not how many steps it can reason.

What the numbers actually say

GSM8K, the headline benchmark. On this set of grade-school math word problems, PaLM 540B with standard prompting scored about 18%. With few-shot CoT it reached about 57% — more than triple — and self-consistency on top pushed it past 74%. The same gap shows up across model families.
Zero-shot CoT. On MultiArith, simply adding "Let's think step by step" lifted GPT-3 (text-davinci-002) from roughly 18% to roughly 79% accuracy — a 60-point swing from one sentence.
It's emergent. Below ~10B parameters CoT often underperforms direct prompting; the curves cross only as scale grows into the tens-to-hundreds of billions. There is no CoT magic on a 1B model.
The bill. A direct answer to a math problem might be 5 tokens; the same problem solved with CoT is often 60–150 tokens. At, say, a price of $10 per million output tokens, that's a real 5–20× cost multiplier on the generation — and self-consistency multiplies it again by your sample count N.
Faithfulness is not guaranteed. Anthropic's 2023 work showed models will write a confident chain that rationalizes an answer planted by a biasing cue the chain never mentions — so the reasoning can be right-looking and causally false at the same time.

JavaScript implementation

Below is a minimal, dependency-light client. It shows both the zero-shot trigger and an answer-extraction step, plus self-consistency by majority vote — the two patterns you'll actually deploy.

// Zero-shot chain-of-thought: one call, trigger phrase, then extract the answer.
async function chainOfThought(question, callModel) {
  const prompt =
    `Q: ${question}\n` +
    `A: Let's think step by step.`;          // the magic trigger

  const reasoning = await callModel(prompt, { temperature: 0 });

  // Second, cheap call to pull the final answer out of the reasoning.
  const extractPrompt =
    `${prompt}${reasoning}\n` +
    `Therefore, the final answer (just the value) is:`;
  const answer = await callModel(extractPrompt, { temperature: 0 });

  return { reasoning, answer: answer.trim() };
}

// Self-consistency: sample N chains at temperature > 0, take the majority vote.
async function selfConsistency(question, callModel, n = 20) {
  const tally = new Map();
  for (let i = 0; i < n; i++) {
    const prompt = `Q: ${question}\nA: Let's think step by step.`;
    const reasoning = await callModel(prompt, { temperature: 0.7 });
    const ans = extractFinalAnswer(reasoning);   // your parser / regex
    tally.set(ans, (tally.get(ans) ?? 0) + 1);
  }
  // Argmax over votes — correct chains tend to agree, wrong ones scatter.
  let best = null, bestVotes = -1;
  for (const [ans, votes] of tally) {
    if (votes > bestVotes) { best = ans; bestVotes = votes; }
  }
  return { answer: best, votes: bestVotes, total: n };
}

Two practical notes. First, generate the reasoning at temperature 0 for a single deterministic chain, but switch to temperature 0.7 for self-consistency — you want diverse paths so the vote is meaningful. Second, the separate extraction call exists because a long chain is hard to parse programmatically; asking the model to restate just the value is more robust than regexing the prose.

Python implementation

The same two patterns in Python, plus few-shot CoT, which prepends worked exemplars rather than relying on the trigger phrase. Few-shot tends to give cleaner, more consistently formatted reasoning.

from collections import Counter

# Few-shot exemplars: each shows the REASONING before the ANSWER.
FEW_SHOT = """\
Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many now?
A: Roger started with 5. Two cans of 3 balls is 6. 5 + 6 = 11. The answer is 11.

Q: There are 4 cars; 2 more arrive. Each car has 4 wheels. How many wheels?
A: 4 + 2 = 6 cars. 6 cars x 4 wheels = 24. The answer is 24.
"""

def chain_of_thought(question, call_model, few_shot=True):
    if few_shot:
        prompt = f"{FEW_SHOT}\nQ: {question}\nA:"
    else:
        prompt = f"Q: {question}\nA: Let's think step by step."   # zero-shot
    return call_model(prompt, temperature=0.0)

def self_consistency(question, call_model, n=20, extract=None):
    """Sample n chains, majority-vote the final answers."""
    votes = Counter()
    for _ in range(n):
        chain = call_model(f"Q: {question}\nA: Let's think step by step.",
                           temperature=0.7)            # diversity for the vote
        answer = extract(chain) if extract else chain.strip().split()[-1]
        votes[answer] += 1
    answer, count = votes.most_common(1)[0]
    return {"answer": answer, "votes": count, "total": n,
            "confidence": count / n}

# Why temperature matters here:
#   - chain_of_thought() at T=0 -> one deterministic reasoning path.
#   - self_consistency() at T=0.7 -> many paths; correct ones converge,
#     so the modal answer beats any single scattered wrong path.

Notice the asymmetry between the two paths, exactly mirroring the JS version: deterministic decoding for a single explanation, stochastic decoding for a vote. Running self-consistency at temperature 0 would sample the same chain N times and defeat the entire point.

Variants worth knowing

Zero-shot CoT (Kojima et al., 2022). No exemplars — just "Let's think step by step." Cheapest to deploy and astonishingly effective on big models, though slightly less reliable in formatting than few-shot.

Self-consistency (Wang et al., 2022). Sample many chains, majority-vote the answers. Pure compute-for-accuracy: it added roughly 17 points on GSM8K over plain CoT and is the single highest-leverage add-on.

Least-to-most prompting (Zhou et al., 2022). Explicitly decompose a hard problem into easier sub-problems, solve them in order, and feed each solution forward. Beats vanilla CoT on problems that are harder than the exemplars (compositional generalization).

Tree of Thoughts (Yao et al., 2023). Instead of one linear chain, explore a tree of partial reasoning states with lookahead and backtracking. Turns reasoning into search; powerful on planning and puzzles but with a large compute blow-up.

ReAct (Yao et al., 2022). Interleave reasoning steps with actions — tool calls, web searches, calculator use. The chain isn't just thought; it grounds itself in external observations, reducing hallucination on knowledge-heavy tasks.

Trained-in reasoning (o1-style RL). The newest direction bakes the chain into the model itself via reinforcement learning on long reasoning traces, so the model "thinks" before answering without being asked. The intermediate tokens may be hidden, but the underlying mechanism — spend tokens to buy serial compute — is the same idea this page describes.

Common bugs and edge cases

Asking for the answer first, reasoning second. If the prompt format makes the model emit the answer before the steps, you get zero benefit — the answer token was produced in the one forward pass you were trying to escape. Reasoning must come before the answer.
Parsing the answer out of the prose by hand. Long chains are messy. Use a dedicated "the final answer is:" extraction step (as in the code above) rather than brittle regexes over free text.
Running self-consistency at temperature 0. You'll sample the identical chain N times and pay N× for one vote. Self-consistency needs temperature > 0 to generate diverse paths.
Using CoT on a model that's too small. Below ~10B parameters it often hurts. The fluent-but-wrong reasoning of a small model can be worse than its direct guess.
Trusting the chain as an explanation. A correct answer doesn't certify the reasoning. Models can post-hoc rationalize, so don't use the chain as a faithful audit log of why the model decided something.
Forgetting the cost. CoT is 5–20× the output tokens, and self-consistency multiplies that by N. On high-volume, easy traffic, that's real money spent for accuracy you didn't need.

Frequently asked questions

Why does chain-of-thought prompting improve accuracy?

A transformer does a fixed amount of computation per token, so a hard problem that needs many serial steps cannot be solved in the single forward pass that produces the answer directly. Emitting intermediate steps as tokens gives the model extra forward passes — each generated word becomes new context the next step can read — effectively letting it 'think longer' before committing to an answer.

What's the difference between zero-shot and few-shot chain-of-thought?

Few-shot CoT (Wei et al., 2022) puts a handful of worked examples in the prompt, each showing the reasoning before the answer. Zero-shot CoT (Kojima et al., 2022) skips the examples and just appends the phrase 'Let's think step by step', which alone lifted GPT-3's accuracy on the MultiArith benchmark from about 18% to about 79%.

Does chain-of-thought work on small models?

Not really. The original paper found CoT is an emergent ability: it barely helps models under roughly 10 billion parameters and can even hurt them, because small models produce fluent-but-wrong reasoning. The accuracy gains appear sharply once models reach the 60B–100B+ range.

What is self-consistency in chain-of-thought?

Self-consistency (Wang et al., 2022) samples many independent chains of thought at non-zero temperature, then takes a majority vote over the final answers instead of trusting one greedy chain. On GSM8K it added roughly 17 percentage points over plain CoT, because correct reasoning paths tend to converge on the same answer while wrong ones scatter.

Does the reasoning the model writes actually reflect how it solved the problem?

Not always. Studies have shown models can produce a plausible chain of thought that rationalizes an answer they were biased toward, while the real cause was something in the prompt the chain never mentions. A correct final answer does not guarantee the printed reasoning is faithful — treat it as a useful scratchpad, not a transparent log of the model's internals.

When should I NOT use chain-of-thought?

Skip it for lookups and one-step tasks (classification, extraction, simple recall) where the extra reasoning adds latency and token cost without improving accuracy. CoT can also make outputs harder to parse and occasionally talks a model out of a correct first instinct on trivial questions.