Databases

Inverted Index

Term → docs, the data structure behind every search engine

An inverted index maps each term to the list of documents containing it. It's the data structure underneath every full-text search engine — Lucene, Elasticsearch, OpenSearch, Postgres GIN, and Google's earliest crawlers — and the reason a query against a billion documents returns in milliseconds.

  • Lookup costO(1) hash + O(p) walk
  • AND query costO(min(p, q))
  • Storage at 1B docsLow single-digit GB compressed
  • Canonical engineApache Lucene
  • RankingTF-IDF / BM25

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How an inverted index works

Indexing has two stages: tokenize, then invert. Given a corpus of documents, the indexer walks each document and emits a stream of (term, doc_id, position) triples. It then groups by term and sorts by doc_id. The result, for each term, is a sorted posting list of the documents it appears in.

Tokenization is the part everyone underestimates. How you split determines what queries can match. A pipeline typically does:

  1. Normalize: lowercase, strip punctuation, fold accents (café → cafe).
  2. Split: on whitespace, word boundaries, or a CJK segmenter for languages without spaces.
  3. Filter: drop stop words (the, a, and) — saves space, blocks "to be or not to be"-style phrase queries.
  4. Stem or lemmatize: running → run, better → good — boosts recall, costs precision.

The choices stack: an English stemmer breaks Chinese, code search, and product SKUs. Production engines run different analyzers per field.

Querying inverts the trick. To answer apple AND banana, fetch the two posting lists and intersect. Score documents by TF-IDF (or BM25) summed across query terms.

Inverted index vs forward vs B-tree vs trigram

Inverted indexForward indexB-treeTrigram (n-gram) index
Mappingterm → docsdoc → termskey → row pointer3-char gram → docs
"Find docs with X"O(1) hash + O(p)O(N) scanPrefix onlyO(g) per gram
Full-text ANDList intersectScanNot supportedGram intersect + verify
Substring "%foo%"No (without n-grams)ScanNoYes
Range / sortNoSequentialYesNo
Storage cost~10-30% of corpus~corpus size~5-10% of indexed col~30-100% of corpus
Used forSearch, log analysisDoc retrieval, snippetsOLTP point/rangeLIKE-style search
EnginesLucene, Postgres GINMost DB row storesInnoDB, Postgres B-treePostgres pg_trgm, MySQL ngram

Real systems use several at once. Elasticsearch keeps an inverted index (matching) and a column store (sorting and aggregation). Postgres can have a GIN index (inverted) plus a B-tree on the same table. B-trees remain unmatched for range scans; the inverted index dominates "contains this word".

Tokenization + AND query — JavaScript

// Build an inverted index, intersect posting lists for an AND query,
// and score with a TF-IDF sketch.

const STOP_WORDS = new Set(['the', 'a', 'an', 'and', 'or', 'of', 'is', 'in']);

function tokenize(text) {
  return text.toLowerCase()
    .replace(/[^a-z0-9\s]/g, ' ')
    .split(/\s+/)
    .filter(t => t && !STOP_WORDS.has(t));
}

class InvertedIndex {
  constructor() {
    this.postings = new Map();   // term -> sorted [docId, ...]
    this.tf = new Map();         // term -> Map(docId -> count)
    this.docCount = 0;
  }

  addDoc(docId, text) {
    this.docCount++;
    const tokens = tokenize(text);
    const counts = new Map();
    for (const t of tokens) counts.set(t, (counts.get(t) || 0) + 1);
    for (const [term, count] of counts) {
      if (!this.postings.has(term)) this.postings.set(term, []);
      const list = this.postings.get(term);
      // Posting lists must stay sorted to support fast intersection.
      if (list[list.length - 1] !== docId) list.push(docId);
      if (!this.tf.has(term)) this.tf.set(term, new Map());
      this.tf.get(term).set(docId, count);
    }
  }

  // Intersect two sorted posting lists in O(min(|a|, |b|)).
  intersect(a, b) {
    const out = [];
    let i = 0, j = 0;
    while (i < a.length && j < b.length) {
      if (a[i] === b[j]) { out.push(a[i]); i++; j++; }
      else if (a[i] < b[j]) i++;
      else j++;
    }
    return out;
  }

  // AND query with TF-IDF ranking. IDF = log(N / df).
  search(query) {
    const terms = tokenize(query);
    if (terms.length === 0) return [];
    let candidates = this.postings.get(terms[0]) || [];
    for (let k = 1; k < terms.length; k++) {
      candidates = this.intersect(candidates, this.postings.get(terms[k]) || []);
    }
    return candidates.map(docId => {
      let score = 0;
      for (const t of terms) {
        const df = (this.postings.get(t) || []).length;
        const tf = this.tf.get(t)?.get(docId) || 0;
        const idf = df ? Math.log(this.docCount / df) : 0;
        score += tf * idf;
      }
      return { docId, score };
    }).sort((a, b) => b.score - a.score);
  }
}

const idx = new InvertedIndex();
idx.addDoc(1, 'The quick brown fox jumps over the lazy dog');
idx.addDoc(2, 'A brown dog barks at the lazy fox');
idx.addDoc(3, 'A quick fox dies young');
idx.search('quick fox');     // [{docId: 3, ...}, {docId: 1, ...}]

The intersection inner loop is the algorithmic core. Real engines speed it up further with skip pointers (jump ahead by k entries when one list is much larger) and SIMD-accelerated decode of compressed posting blocks. Skip lists are the conceptual ancestor.

Python implementation with TF-IDF

import math
import re
from collections import defaultdict

STOP_WORDS = {'the', 'a', 'an', 'and', 'or', 'of', 'is', 'in'}

def tokenize(text: str) -> list[str]:
    return [t for t in re.split(r'[^a-z0-9]+', text.lower())
            if t and t not in STOP_WORDS]

class InvertedIndex:
    def __init__(self):
        self.postings: dict[str, list[int]] = defaultdict(list)
        self.tf: dict[str, dict[int, int]] = defaultdict(lambda: defaultdict(int))
        self.doc_count = 0

    def add_doc(self, doc_id: int, text: str) -> None:
        self.doc_count += 1
        tokens = tokenize(text)
        counts: dict[str, int] = defaultdict(int)
        for t in tokens:
            counts[t] += 1
        for term, c in counts.items():
            if not self.postings[term] or self.postings[term][-1] != doc_id:
                self.postings[term].append(doc_id)   # stays sorted by insertion order
            self.tf[term][doc_id] = c

    @staticmethod
    def _intersect(a: list[int], b: list[int]) -> list[int]:
        out, i, j = [], 0, 0
        while i < len(a) and j < len(b):
            if a[i] == b[j]:
                out.append(a[i]); i += 1; j += 1
            elif a[i] < b[j]:
                i += 1
            else:
                j += 1
        return out

    def search(self, query: str) -> list[tuple[int, float]]:
        terms = tokenize(query)
        if not terms:
            return []
        candidates = list(self.postings.get(terms[0], []))
        for t in terms[1:]:
            candidates = self._intersect(candidates, self.postings.get(t, []))
        scored = []
        for doc_id in candidates:
            score = 0.0
            for t in terms:
                df = len(self.postings.get(t, []))
                tf = self.tf[t].get(doc_id, 0)
                idf = math.log(self.doc_count / df) if df else 0.0
                score += tf * idf
            scored.append((doc_id, score))
        return sorted(scored, key=lambda x: -x[1])

# Production tip: BM25 replaces raw TF-IDF in Lucene/Elasticsearch.
# BM25 saturates the TF term (tf / (tf + k)) so a 100-occurrence document
# doesn't dominate a 5-occurrence one, and normalizes by document length.

Variants — posting-list compression

The reason a billion-document index fits in memory is that posting lists compress brutally well. Three families dominate:

  • VByte: store each delta in 1-5 bytes; high bit signals continuation. Fast to decode, ~2-4 bytes per posting. Used in Lucene's earliest formats.
  • Simple9 / Simple16: pack as many deltas as possible into a 32-bit word using a 4-bit selector. Decoder is a table lookup; ~1-2 bytes per posting.
  • PFor / FastPFor: bit-pack a block of N deltas at the median bit-width, escape exceptions. Lucene's BlockPostingsFormat uses PForDelta over 128-doc blocks.
  • Roaring Bitmaps: hybrid array/bitmap for posting sets; powers Pinot, Druid, Lucene's SortedNumericDocValues.

Engine variants by use case:

  • Apache Lucene: the de facto reference. Per-segment immutable posting lists, periodic merge. Powers Elasticsearch, OpenSearch, Solr.
  • Postgres GIN: built-in inverted index for tsvector full-text, jsonb, array containment.
  • Tantivy: Rust port of Lucene — used by Quickwit, Meilisearch internals.
  • Vespa: Yahoo's search platform; combines inverted index with vector ANN.
  • FAISS / HNSW vectors: modern complement — exact match in the inverted index, semantic match in the dense-vector index, learned reranker on top.

What an inverted index costs

At 1 billion documents averaging 1KB of text, the raw posting count is ~100-200 billion postings after stop-word removal and stemming. With PFor compression at ~1-2 bytes/posting, that's 100-300GB — fits across a small Elasticsearch cluster. Without compression, 4-8× that — the difference between "in memory" and "thrashes disk on every query".

Indexing throughput is rarely the bottleneck (10-100MB/s/core), with merging dominating at high update rates. Query throughput is governed by which posting lists fit in page cache. Impact-ordered indexes — sorting posting lists by document score — let a query stop early and answer top-k in time proportional to k, not the full intersection.

Every feature has a posting-list cost: adding a search field doubles the index for it; phrase queries need positions, costing ~3-4× more bytes.

Common bugs and edge cases

  • Stop-word handling. Dropping stop words speeds queries but breaks phrase searches like "to be or not to be". Production engines often keep them but penalize their score.
  • Stemming bias. An aggressive stemmer maps "universe" and "university" both to "univers" — recall up, precision down. Never apply English stemming to product codes.
  • Tokenizer mismatch. Indexer normalizes "AT&T" to "att"; query parser leaves it as "at&t". Zero hits. Always run query and document text through the same analyzer.
  • Forgetting positions for phrase queries. Without positions, "New York" matches any document containing both terms anywhere.
  • Hot terms. A query for "the" (if not stripped) returns a posting list of every document. Block at parse time or rely on IDF to score it as worthless.
  • Segment merge storms. High-update workloads trigger cascading merges that throttle indexing throughput.
  • Delete hostility. "Delete" means tombstone-then-merge; until merge runs, deleted docs still occupy posting space.

When to reach for an inverted index

  • Full-text search over user-generated content, logs, documentation, code.
  • Tag / faceted filtering at scale (each facet is a term).
  • Boolean retrieval — find docs matching a complex AND/OR/NOT predicate over high-cardinality flags.
  • Trigram / substring search (%foo%) — a special case where the "term" is an n-gram.

Use a B-tree instead when you need range queries, ordered traversal, or point lookups by a sortable key. Use a vector index when you need semantic similarity. Use both — Elasticsearch's hybrid search (sparse inverted plus dense vector with a learned-weight blend) is now the default for any production search system that cares about recall.

Frequently asked questions

Why is it called "inverted"?

Because it inverts the natural document → terms relationship into term → documents. A forward index says "document 7 contains the words apple, banana, cherry"; the inverted index says "banana appears in documents 3, 7, 14, 22". The inverted form is what makes "find me docs containing X" a near-instant lookup.

What is a posting list?

The list of postings (document IDs, often with positions and term frequencies) for a single term. "banana → [3, 7, 14, 22]" is the posting list for "banana". Sorted, often delta-encoded, and frequently compressed with VByte or PFor to keep large indexes in memory.

How does an AND query work?

Intersect the posting lists. To answer "apple AND banana", take both lists (sorted) and walk them in lockstep — only emit document IDs present in both. Linear in the size of the smaller list. The "galloping" or "skip-list" variants get sublinear when one list is much larger than the other.

What is TF-IDF?

Term frequency × inverse document frequency: a per-term weight that's high when the term occurs often in this document but rarely across the corpus. Used to rank results — "apple" in a fruit catalog is less informative than "pomelo". Modern engines use BM25, a refined TF-IDF variant that tames long documents and very frequent terms.

How are huge indexes stored efficiently?

Sort the posting list, store deltas (1, +5, +9, +13 instead of 1, 6, 15, 28), then compress the deltas with VByte, Simple9, PFor, or the more recent Roaring Bitmaps. A billion-document corpus has total posting bytes in the low GB, fitting in RAM on a single machine.