Machine Learning

In-Context Learning

Teach a frozen model a new task with three lines of prompt

In-context learning is the ability of a large language model to learn a new task from a few examples placed in its prompt — adapting its behavior at inference time with no gradient steps and no weight updates at all.

Weight updateszero
Adaptation pointinference time
Typical shots0–32 examples
Attention costO(n²) in tokens
Named inGPT-3, 2020

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How in-context learning works

Hand a frozen language model a brand-new task it was never trained on — say, translating made-up words from a fictional language — and it can often do it correctly, just by reading three or four examples you typed into the prompt. No fine-tuning. No gradient descent. No weight changes. The behavior appears inside a single forward pass and vanishes the moment you clear the conversation. That phenomenon is in-context learning (ICL), and it's the capability that turned language models from autocomplete into general-purpose tools.

The mechanism is deceptively simple from the outside. You build a prompt that looks like a tiny worked dataset:

Translate to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe en peluche
cheese =>            <-- the model now completes this

The model reads the demonstrations, infers the latent task ("translate English to French, output only the translation"), and continues the pattern. It learned the task — not by updating parameters, but by conditioning its next-token distribution on the tokens already in the window. Brown et al. introduced this framing in the 2020 GPT-3 paper "Language Models are Few-Shot Learners," and the name stuck.

The slogan worth remembering: training writes to the weights; in-context learning writes to the context window. The weights are a read-only program; the prompt is the program's runtime input. ICL is the model executing a task specification it was handed at call time.

The mechanism: induction heads and attention

Why can a frozen network adapt at all? The leading mechanistic explanation is the induction head, identified by Anthropic's interpretability work (Olsson et al., 2022). An induction head is a pair of attention heads in different layers that together implement a copy-and-continue rule: when the model sees a token A that earlier appeared as A B, the head attends back to that earlier occurrence and raises the probability of B next. In short: "the last time I saw A, B came after it — so predict B."

That single circuit is enough to copy and extend arbitrary patterns, which is exactly what few-shot prompting asks for. Crucially, induction heads form during a sharp phase change in training — there is a point where loss suddenly drops and ICL ability suddenly appears. That abruptness is why in-context learning is described as an emergent ability: it shows up past a scale threshold rather than improving smoothly with size.

There is also a deeper theory: under the right conditions, a transformer's forward pass can simulate a learning algorithm. Several 2022–2023 results (von Oswald et al.; Akyürek et al.; Garg et al.) showed that on simple regression tasks, a transformer doing ICL produces predictions matching one step of gradient descent — or even ridge regression — computed implicitly inside the attention layers. The model isn't just copying; on structured tasks it appears to run a tiny optimizer in its activations. The weights encode a learner; the prompt supplies that learner's training set.

The cost lives in attention. Self-attention compares every token to every other token, so a prompt of n tokens costs O(n²) time and O(n²) attention memory (before optimizations like FlashAttention or KV-caching). Doubling your number of examples roughly quadruples the per-prompt compute. That quadratic wall is the real budget constraint on "just add more shots."

When to use in-context learning

Prototyping a task in minutes. No training pipeline, no labeled dataset of thousands — write five examples and ship.
Tasks that change often. Classification schemes, extraction formats, or tone guidelines that shift weekly are cheaper to edit in a prompt than to re-train.
Few labeled examples available. When you have ten gold examples, not ten thousand, ICL extracts more value per label than fine-tuning.
One model, many tasks. A single deployed model serves dozens of tasks, each selected purely by its prompt — no per-task checkpoints to host.

Reach for fine-tuning or LoRA instead when the task is fixed, high-volume, and latency- or cost-sensitive: paying for the same long example block on every one of a billion calls is wasteful when those examples could be compiled into the weights once. ICL trades per-call tokens for zero setup; fine-tuning trades setup for cheap calls.

In-context learning vs other adaptation methods

	In-context learning	Full fine-tuning	LoRA / adapters	RAG	Prompt/prefix tuning
Changes weights?	No	Yes (all)	Yes (small added matrices)	No	No — model frozen, trains added soft-prompt vectors
Adaptation persists?	No — only in the live context	Yes	Yes	No — retrieved per query	Yes
Setup cost	None — write examples	Hours–days of GPU training	Minutes–hours	Build + index a corpus	Minutes–hours
Per-query token cost	High (examples in every prompt)	Low	Low	Medium (retrieved chunks)	Low (few soft tokens)
Labeled data needed	A handful (≤32)	Thousands+	Hundreds–thousands	A document corpus	Hundreds–thousands
Forgets when session ends?	Yes — instantly	No	No	Yes	No
Best for	Fast iteration, shifting tasks	Fixed high-volume task	Cheap task specialization	Fresh / private knowledge	Lightweight steering

The cleanest mental split: ICL and RAG inject information at inference time through the context; fine-tuning, LoRA, and prompt-tuning inject it at training time through gradients. ICL and RAG are often combined — retrieval picks which examples to put in the context, and ICL turns those examples into behavior.

What the numbers actually say

Shots help, then plateau. On many GPT-3 benchmarks, jumping from zero-shot to a few-shot prompt lifted accuracy by 10–30 absolute points; gains then flatten and can even reverse as the window fills with noise.
Quadratic token cost. Self-attention is O(n²). Going from 8 examples (~320 tokens) to 64 examples (~2,560 tokens) is an 8× length increase but roughly a 64× increase in raw attention compute for the prompt.
Per-call price, not one-time. A 2,000-token example block at ~$3 per million input tokens costs ~$0.006 per call. Over 100 million calls that's $600,000 spent re-reading examples — money a one-time fine-tune could have saved.
Labels barely matter. Min et al. (2022) found that replacing the correct labels in few-shot examples with random labels often dropped accuracy by only a few points — evidence the model leans heavily on the format, label set, and input distribution rather than the literal input→label mapping.
Order is fragile. Reordering the same examples has been shown to swing accuracy from near-random to near-state-of-the-art on some tasks — the model is sensitive to the literal token sequence, including recency bias toward the last example.

JavaScript implementation

In-context learning has no training loop to implement — the "algorithm" is constructing the prompt. Here's a few-shot prompt builder that assembles demonstrations and calls a chat model, the way you'd actually wire ICL into an app:

// Build a few-shot prompt from labeled examples + a new query.
// No weights change — the "learning" is entirely in this string.
function buildFewShotPrompt({ instruction, examples, query, delimiter = '=>' }) {
  const shots = examples
    .map(ex => `${ex.input} ${delimiter} ${ex.output}`)
    .join('\n');
  // Trailing query has NO output — the model completes it.
  return `${instruction}\n\n${shots}\n${query} ${delimiter}`;
}

const prompt = buildFewShotPrompt({
  instruction: 'Classify the sentiment as positive or negative.',
  examples: [
    { input: 'I loved every minute.',      output: 'positive' },
    { input: 'A complete waste of time.',  output: 'negative' },
    { input: 'Best purchase this year.',   output: 'positive' },
  ],
  query: 'The plot dragged and I fell asleep.',
});

async function classify(prompt) {
  const res = await fetch('https://api.example-llm.com/v1/complete', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt, max_tokens: 4, stop: ['\n'] }),
  });
  const { text } = await res.json();
  return text.trim();          // -> "negative"
}

Two things to notice. First, the final line ends with the delimiter and no answer — that dangling pattern is the cue for the model to continue. Second, a stop sequence on the newline keeps the model from inventing a fifth example after it answers.

Python implementation

The same idea, plus a touch of robustness most production ICL pipelines add: a consistent template, label normalization, and example selection. Selecting the nearest examples to the query (retrieval-augmented ICL) usually beats a fixed set.

from dataclasses import dataclass

@dataclass
class Example:
    text: str
    label: str

def build_prompt(instruction, examples, query, sep="\n"):
    """Assemble a few-shot prompt. No model weights are touched."""
    lines = [instruction, ""]
    for ex in examples:
        lines.append(f"Review: {ex.text}")
        lines.append(f"Sentiment: {ex.label}")
        lines.append("")
    lines.append(f"Review: {query}")
    lines.append("Sentiment:")          # dangling -> model completes
    return sep.join(lines)

# Retrieval-augmented ICL: pick the k most similar examples to the query.
def select_examples(query_vec, pool, embed, k=3):
    import numpy as np
    def cos(a, b):
        return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))
    scored = sorted(pool, key=lambda ex: cos(query_vec, embed(ex.text)), reverse=True)
    return scored[:k]

pool = [
    Example("I loved every minute.", "positive"),
    Example("A complete waste of time.", "negative"),
    Example("Best purchase this year.", "positive"),
    Example("It broke after one use.", "negative"),
]

query = "The plot dragged and I fell asleep."
# shots = select_examples(embed(query), pool, embed, k=3)   # if you have embeddings
shots = pool[:3]
prompt = build_prompt("Classify the review sentiment.", shots, query)

# label normalization: map the raw completion onto your label set
def normalize(raw, labels=("positive", "negative")):
    raw = raw.strip().lower()
    for lab in labels:
        if raw.startswith(lab):
            return lab
    return "unknown"

# completion = call_model(prompt, max_tokens=4, stop=["\n"])
# print(normalize(completion))   # -> "negative"

The select_examples step is where retrieval and ICL meet: instead of a hard-coded handful, you embed the query, pull the most similar labeled examples from a larger pool, and let ICL generalize from those. This is the bridge to retrieval-augmented generation.

Variants worth knowing

Zero-shot, one-shot, few-shot. The classic axis from the GPT-3 paper: zero demonstrations (instruction only), one, or several. More shots generally help until the window saturates or noise dominates.

Chain-of-thought prompting. Wei et al. (2022) showed that adding worked reasoning steps to the examples — not just input→answer but input→reasoning→answer — sharply improves multi-step arithmetic and logic. The model learns to "think out loud" purely from the format of the demonstrations.

Instruction following / zero-shot ICL. Instruction-tuned models (RLHF-aligned chat models) often need no examples — a plain natural-language instruction triggers the task. This is in-context learning where the "demonstration" is a description rather than examples.

Many-shot ICL. With 100K+ token windows you can fit hundreds or thousands of examples. Google's 2024 many-shot work showed accuracy can keep climbing well past the few-shot regime — sometimes rivaling fine-tuning — at the price of an expensive, long prompt on every call.

Retrieval-augmented ICL. Rather than a fixed example set, retrieve the most relevant demonstrations per query from a large pool. Combines a vector index (see vector search) with the ICL forward pass.

Common bugs and pitfalls

Forgetting the dangling cue. If your last line already contains the answer, the model has nothing to complete. End with "Sentiment:" (no value), not a finished pair.
Inconsistent formatting. Mixing => in some examples and : in others breaks the pattern the model is keying on. Pick one delimiter and one layout, everywhere.
No stop sequence. Without a stop token, the model happily generates a fifth fake example after answering. Stop on the delimiter or newline.
Recency and majority bias. Models over-weight the last example and over-predict whichever label appears most. Balance your label counts and don't always end on the same class.
Assuming the model "understood." Random-label experiments show ICL often relies on format and label space more than the true mapping. Don't trust it on a task where the input→output relation is genuinely subtle without checking.
Paying token cost forever. If you call the same task millions of times, a long example block is recurring spend. Past a volume threshold, fine-tuning or LoRA is cheaper — measure the crossover.
Window overflow. Stuffing too many shots truncates the actual query. Always reserve room for the input and the reply; count tokens before sending.

Frequently asked questions

Does in-context learning change the model's weights?

No. Every weight is frozen. The only thing that changes is the contents of the context window — the examples you paste in. The model produces a different answer because the forward pass conditions on those tokens, not because anything was trained. Close the session and the model has learned nothing permanently.

What's the difference between zero-shot, one-shot, and few-shot prompting?

They differ only in how many worked examples you put before the real query. Zero-shot gives a bare instruction with no examples; one-shot gives exactly one input→output demonstration; few-shot gives a handful, typically 2 to 32. The GPT-3 paper showed accuracy generally climbs with more shots until the context window fills up.

Why do small models fail at in-context learning while large ones succeed?

In-context learning is an emergent ability — it appears fairly abruptly past a parameter and training-data threshold rather than improving smoothly. Mechanistically it depends on induction heads, attention circuits that form during training and copy patterns of the form "A followed by B … so after A predict B". Models too small to form robust induction heads can't generalize from prompt examples.

Is in-context learning the same as fine-tuning?

No. Fine-tuning runs gradient descent and permanently rewrites weights, so the adaptation persists and costs nothing extra per query. In-context learning bakes the examples into every prompt, so it costs tokens on every single call and forgets the moment the context is cleared. Fine-tuning wins on volume and latency; in-context learning wins on flexibility and zero setup.

How many examples can I fit in a prompt?

As many as fit in the context window minus your query and the model's reply. With a 128K-token window and ~40-token examples, that's roughly 3,000 demonstrations. But attention cost scales as O(n²) in sequence length, so doubling the examples roughly quadruples the compute for the prompt — there's a real price to stuffing the window.

Why does the order and format of examples change the answer so much?

Because the model is pattern-matching on the literal token sequence, not parsing a clean dataset. Studies found that example order, label balance, and even consistent delimiters can swing accuracy by tens of points. Notoriously, randomly shuffling the labels on the examples often barely hurts accuracy — the model leans on the format and label space more than the actual input-label mapping.