Machine Learning

Retrieval-Augmented Generation (RAG)

Give a language model an open book instead of a perfect memory

Retrieval-Augmented Generation (RAG) bolts a vector-search retriever onto a large language model so it answers from fetched documents instead of memorized weights — cutting hallucinations and adding citations without retraining.

  • IntroducedLewis et al., 2020
  • Retriever cost≈ O(log n) per query (ANN)
  • Typical chunk size200–500 tokens
  • Typical top-k3–10 chunks
  • Update knowledgere-index, no retrain

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How RAG works

A standalone large language model answers from its weights — a lossy, frozen compression of whatever text it saw during training. Ask it about a document it never read, or a fact that changed last week, and it does the only thing it can: it generates the most statistically plausible continuation, which is often a confident, fluent, wrong answer. That's a hallucination, and no amount of clever prompting removes it, because the knowledge simply isn't in the weights.

Retrieval-Augmented Generation, introduced by Patrick Lewis and colleagues at Facebook AI in 2020, fixes this by separating knowledge from reasoning. The model keeps its language and reasoning ability; the facts move out into an external, searchable store. At query time the system fetches the relevant text and hands it to the model as part of the prompt, so the model answers from an open book instead of from memory.

There are two phases:

  1. Indexing (offline, once per document). Split every source document into chunks, run each chunk through an embedding model that maps text to a vector (e.g. 384–3072 dimensions), and store those vectors in a vector index. Embeddings place semantically similar text close together in the vector space — "how do I cancel my plan" lands near "subscription termination policy" even with no shared words.
  2. Retrieval + generation (online, per query). Embed the user's question with the same model, find its nearest chunk vectors by cosine similarity, and inject the top-k chunks into the prompt alongside an instruction like "Answer using only the context below; cite the source." The LLM now reads the actual text and grounds its answer in it.

The whole pipeline is just five stages: chunk → embed → store → retrieve → generate. Everything else — rerankers, hybrid search, query rewriting — is a refinement bolted onto that spine.

The retrieval mechanism and its cost

Retrieval is a nearest-neighbor search in vector space. Given a query vector q and a corpus of chunk vectors {d₁ … dₙ}, you want the k vectors maximizing cosine similarity:

cos(q, d) = (q · d) / (‖q‖ · ‖d‖)

If vectors are L2-normalized at index time, the denominator is 1 and similarity reduces to a plain dot product, which is what hardware-optimized libraries actually compute.

Exact search is O(n·D). Comparing the query against all n chunks, each of dimension D, is linear in corpus size. At a million chunks of dimension 768 that's ~768 million multiply-adds per query — fine for a prototype, far too slow at scale and per request.

Approximate nearest-neighbor (ANN) search is ≈ O(D·log n). Production RAG uses an ANN index — most commonly HNSW (a navigable small-world graph) — that trades a sliver of recall for a logarithmic query path. HNSW builds at roughly O(n·log n) and answers queries in O(D·log n), so a billion-vector index answers in single-digit milliseconds. The cost: it returns the approximate top-k (typically 95–99% recall@10), and the graph itself eats RAM — on the order of the raw vectors plus the edge lists.

The generation step's cost is dominated by the LLM's attention over the prompt. Transformer self-attention is O(L²) in sequence length L, so doubling the retrieved context roughly quadruples the attention work and the latency. That quadratic is exactly why you retrieve few, high-quality chunks rather than dumping the whole corpus into the window.

When to use RAG (and when not to)

  • Knowledge that changes. Docs, policies, prices, tickets, news. Re-index the changed file and it's searchable in seconds — no retraining run.
  • Answers that must cite sources. Because you know exactly which chunks fed the answer, you can show the user the receipts. This is non-negotiable for legal, medical, and compliance use.
  • Private or proprietary corpora. Your data never has to enter anyone's training set; it lives in your index.
  • Long-tail facts. Information too rare or too recent for the model to have memorized reliably.

RAG is the wrong tool when the gap is skill, not facts: if the model can't follow your output format, reason in your domain's style, or speak a low-resource language, retrieval won't help — that's a fine-tuning problem. It's also overkill when the entire knowledge base fits comfortably in the context window; just paste it in. And if your queries are exact-match lookups by ID or keyword, a plain inverted index or SQL query beats embedding everything.

RAG vs other ways to give a model knowledge

RAGFine-tuningLong-context stuffingTool / function calling
Where knowledge livesExternal vector indexModel weightsThe prompt, per requestExternal API / DB
Update latencySeconds (re-index one doc)Hours–days (retrain)Instant but re-pasted every callInstant (live source)
Can cite sources?Yes — returns the chunksNoYes (it's all in-prompt)Yes (structured result)
Per-query costANN search + k chunks of tokensJust generationWhole corpus tokenized every callAPI call + result tokens
Scales to corpus of…Billions of chunksBounded by training dataBounded by context window (≤ ~1–2M tokens)Whatever the backend holds
Handles structured/live dataPoorly (it's text similarity)NoPoorlyExcellently
Teaches new style/skillNoYes — its core strengthA little (few-shot)No
Main failure modeRetriever misses the right chunkStale facts, hallucination"Lost in the middle"; token costBrittle schemas / API errors

These aren't mutually exclusive. The strongest production systems fine-tune for behavior, use RAG for facts, and call tools for live or structured data — three layers solving three different problems.

What the numbers actually say

  • Re-indexing beats retraining by orders of magnitude. Embedding a new 10-page document is a handful of API calls costing a fraction of a cent and finishing in under a second. A full fine-tune of a mid-size model is hours of GPU time and tens to thousands of dollars — and you'd repeat it every time a fact changed.
  • ANN turns a linear scan into a log walk. Exact cosine over 1M × 768-dim vectors is ~768M ops per query; HNSW visits a few hundred nodes instead, hitting single-digit-millisecond latency at 95–99% recall@10.
  • Context is quadratic, so chunk count matters. Attention is O(L²): retrieving 10 chunks of 400 tokens (~4k tokens of context) instead of 3 chunks (~1.2k) is ~11× more attention work on that span — plus the token bill scales linearly with what you send.
  • "Lost in the middle" is real. Liu et al. (2023) showed LLMs answer best when the relevant passage is at the very start or end of a long context and noticeably worse when it's buried in the middle — a direct argument for retrieving fewer, better-ranked chunks rather than padding the window.
  • Rerankers move the needle. Adding a cross-encoder reranker on top of a vector retriever commonly lifts answer relevance by double-digit percentages on retrieval benchmarks, for the price of one extra model pass over ~50 candidates.

JavaScript implementation

A minimal end-to-end RAG loop with in-memory cosine search. In production you'd swap the linear scan for a vector database (Pinecone, Weaviate, pgvector, FAISS), but the logic is identical.

// --- 1. INDEX: chunk + embed + store -------------------------------
function chunkText(text, size = 400, overlap = 60) {
  const words = text.split(/\s+/);
  const chunks = [];
  for (let i = 0; i < words.length; i += size - overlap) {
    chunks.push(words.slice(i, i + size).join(' '));
    if (i + size >= words.length) break;
  }
  return chunks;
}

// embed() calls your embedding model; returns a Float32-like vector.
async function buildIndex(docs, embed) {
  const store = [];
  for (const doc of docs) {
    for (const chunk of chunkText(doc.text)) {
      const v = normalize(await embed(chunk));   // L2-normalize once, at index time
      store.push({ source: doc.id, text: chunk, vec: v });
    }
  }
  return store;
}

function normalize(v) {
  const norm = Math.hypot(...v) || 1;
  return v.map(x => x / norm);
}

// --- 2. RETRIEVE: nearest-k by cosine (= dot product on unit vectors)
function dot(a, b) { let s = 0; for (let i = 0; i < a.length; i++) s += a[i] * b[i]; return s; }

async function retrieve(store, query, embed, k = 4) {
  const q = normalize(await embed(query));
  return store
    .map(item => ({ ...item, score: dot(q, item.vec) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, k);
}

// --- 3. GENERATE: stuff retrieved context into the prompt ----------
async function ragAnswer(store, query, embed, llm) {
  const hits = await retrieve(store, query, embed, 4);
  const context = hits
    .map((h, i) => `[${i + 1}] (source: ${h.source})\n${h.text}`)
    .join('\n\n');
  const prompt =
    `Answer the question using ONLY the context below. ` +
    `Cite sources as [1], [2]. If the answer isn't present, say "I don't know."\n\n` +
    `Context:\n${context}\n\nQuestion: ${query}`;
  return llm(prompt);   // your chat-completion call
}

Two details carry their weight. First, the query and the documents must be embedded by the same model — mixing embedding spaces makes the cosine scores meaningless. Second, the instruction "using ONLY the context" plus an explicit "say I don't know" escape hatch is what converts a fluent guesser into a grounded answerer; without it the model happily falls back to its weights.

Python implementation

The same pipeline in Python, with a hybrid twist: retrieve a wide set with fast vector search, then optionally rerank. Vectors are stored as a single NumPy matrix so the "nearest-k" step is one matrix-vector product.

import numpy as np

def chunk_text(text, size=400, overlap=60):
    words = text.split()
    step = size - overlap
    return [" ".join(words[i:i + size]) for i in range(0, len(words), step)]

class RAGIndex:
    def __init__(self, embed):
        self.embed = embed          # embed(str) -> np.ndarray
        self.matrix = None          # (n_chunks, dim), L2-normalized
        self.meta = []              # parallel list of {"source", "text"}

    def build(self, docs):
        vecs, meta = [], []
        for doc in docs:
            for chunk in chunk_text(doc["text"]):
                v = self.embed(chunk)
                v = v / (np.linalg.norm(v) + 1e-9)   # normalize once
                vecs.append(v)
                meta.append({"source": doc["id"], "text": chunk})
        self.matrix = np.vstack(vecs).astype("float32")
        self.meta = meta

    def retrieve(self, query, k=4):
        q = self.embed(query)
        q = q / (np.linalg.norm(q) + 1e-9)
        scores = self.matrix @ q                     # cosine == dot on unit vectors
        top = np.argpartition(-scores, k)[:k]        # O(n) selection, not full sort
        top = top[np.argsort(-scores[top])]
        return [{**self.meta[i], "score": float(scores[i])} for i in top]

def rag_answer(index, query, llm, k=4):
    hits = index.retrieve(query, k)
    context = "\n\n".join(
        f"[{i+1}] (source: {h['source']})\n{h['text']}" for i, h in enumerate(hits)
    )
    prompt = (
        "Answer using ONLY the context below. Cite sources as [1], [2]. "
        'If the answer is not present, say "I don\'t know."\n\n'
        f"Context:\n{context}\n\nQuestion: {query}"
    )
    return llm(prompt), hits          # return hits so the UI can show citations

Note np.argpartition: it finds the top-k in O(n) instead of sorting all n scores in O(n·log n). At a few thousand chunks that's negligible; at a million it's the difference between a usable demo and a hung request — and it's exactly the kind of optimization a real vector DB does for you under the hood.

Variants worth knowing

Hybrid retrieval. Run dense (embedding) search and sparse (BM25 keyword) search in parallel, then fuse the rankings with Reciprocal Rank Fusion. Dense catches paraphrase; sparse catches exact tokens like error codes and product names that embeddings smear together. Hybrid reliably beats either alone.

Reranking (two-stage retrieval). Use a cheap bi-encoder to fetch ~50 candidates, then a cross-encoder that reads the query and each chunk together to re-score them, keeping the top 3–5. Slower per pair, dramatically more precise — the single highest-leverage upgrade for most systems.

HyDE (Hypothetical Document Embeddings). Ask the LLM to draft a fake answer to the query, embed that, and search with it. A hypothetical answer often sits closer in vector space to the real answer chunks than the terse question does.

Query rewriting / multi-query. Expand a vague question into several focused sub-queries, retrieve for each, and union the results — useful for multi-hop questions where no single chunk holds the whole answer.

GraphRAG. Build a knowledge graph of entities and relations over the corpus, then retrieve connected subgraphs instead of independent chunks. Better for "summarize the themes across all documents" questions that flat chunk retrieval handles poorly.

Agentic / iterative RAG. Let the model retrieve, read, decide it needs more, and retrieve again in a loop, rather than a single fixed fetch. Trades latency and token cost for the ability to answer questions that need several rounds of lookup.

Common bugs and edge cases

  • Mismatched embedding models. Indexing with one model and querying with another — or upgrading the embedding model without re-indexing — silently destroys retrieval. Query and corpus must share one vector space, so a model change means a full re-index.
  • Chunks too big or too small. Whole-document embeddings average away the specifics and match nothing; one-sentence chunks fragment the answer across many hits. 200–500 tokens with 10–20% overlap is the usual sweet spot, and overlap stops answers from being severed at a chunk boundary.
  • Retrieving on the raw question. Conversational queries ("what about the second one?") carry no standalone meaning; rewrite them into self-contained queries using the chat history before embedding.
  • No "I don't know" path. If the retriever returns junk and the prompt doesn't authorize abstention, the model fills the void with a confident hallucination. Always give it permission to refuse, and consider a similarity-score threshold below which you don't answer.
  • Ignoring "lost in the middle." Dumping top-20 chunks in arbitrary order buries the best evidence mid-context where the model attends least. Rerank, trim to the few best, and place them at the edges of the context.
  • Stale index. RAG's whole selling point is freshness, but only if your indexing pipeline actually re-runs when documents change. A delete in the source that never propagates to the index means the model keeps citing deleted, wrong, or retracted content.

Frequently asked questions

Does RAG eliminate hallucinations?

No — it reduces them. Grounding the model in retrieved text makes it far more likely to answer from facts, but the LLM can still misread a passage, blend two chunks, or hallucinate when the retriever returns nothing relevant. RAG shifts the failure mode from "the model invents an answer" to "the model answers from the wrong document," which is at least auditable because you can show the retrieved sources.

RAG vs fine-tuning — which should I use?

Use RAG when knowledge changes often or must be cited (docs, policies, product catalogs) — you re-index instead of retraining, so a new document is searchable in seconds. Use fine-tuning to teach a style, format, or skill the model lacks. They compose: fine-tune for behavior, RAG for facts. RAG is far cheaper to keep current; fine-tuning bakes knowledge into weights that go stale the moment the source changes.

Why split documents into chunks instead of embedding whole files?

An embedding is a fixed-size vector, so a 50-page PDF and a single sentence both collapse to, say, 768 numbers — the long document's vector becomes an average that matches nothing precisely. Chunking (typically 200–500 tokens with 10–20% overlap) keeps each vector semantically tight so the retriever can return the exact passage that answers the query, and keeps the prompt within the model's context window.

What is the difference between dense and sparse retrieval in RAG?

Sparse retrieval (BM25, TF-IDF) matches exact tokens — great for names, error codes, and rare keywords. Dense retrieval embeds query and documents into a shared vector space and matches by cosine similarity — great for paraphrase and meaning. Hybrid retrieval runs both and fuses the rankings (e.g. Reciprocal Rank Fusion), which in practice beats either alone.

How many chunks should I retrieve (what is top-k)?

Top-k is how many nearest chunks you stuff into the prompt. Typical values are 3 to 10. Too few and you miss the answer; too many and you blow the context budget, raise latency and token cost, and dilute the signal (the "lost in the middle" effect, where models attend poorly to text buried in long contexts). A reranker lets you retrieve many candidates cheaply, then keep only the best 3–5.

What is a reranker and do I need one?

A reranker is a second-stage model (usually a cross-encoder) that scores each (query, chunk) pair jointly instead of comparing pre-computed vectors. It is slower per pair but far more accurate, so the standard pattern is retrieve ~50 candidates with fast vector search, then rerank down to the top 5. It is optional, but it is the single highest-leverage quality fix in most RAG systems.