Data Structures

Vector Search (HNSW)

Q: Why is HNSW only approximate, not exact?

A greedy walk through the graph can get stuck in a local minimum — a node whose neighbors are all farther from the query than itself, even though a closer node exists elsewhere. Widening the search beam (the ef parameter) explores more candidates and lowers the miss rate, but exact nearest neighbor in high dimensions costs a full O(n·d) scan, which is exactly what HNSW trades away for speed. Typical recall sits at 95–99% with ef tuned in the low hundreds.

Q: What do M and ef mean in HNSW?

M is the maximum number of neighbor links each node keeps per layer (base layer gets up to 2·M). Higher M means a denser graph: better recall and faster search, but more memory and slower builds. ef is the size of the dynamic candidate list the search keeps; efConstruction controls graph quality at build time and efSearch controls the speed–recall trade at query time. Raise ef to buy recall, lower it to buy latency.

Q: How does HNSW achieve O(log n) search?

It borrows the skip-list idea. Each node is assigned a random maximum layer from an exponentially decaying distribution, so the top layer is sparse with long-range links and lower layers are progressively denser. Search starts at the single top entry point, greedily descends taking the biggest useful jumps first, then refines locally on the dense base layer. The expected number of hops scales with the number of layers, which is logarithmic in n.

Q: Is HNSW better than IVF or product quantization?

HNSW usually wins on recall-at-fixed-latency for in-memory datasets up to tens of millions of vectors, which is why it is the default in FAISS, pgvector, Qdrant, Weaviate, and Milvus. IVF (inverted file) and product quantization win on memory: PQ can compress a 768-dim float vector from 3 KB to ~64 bytes. Production systems often combine them — IVF or PQ to shrink memory, HNSW or a coarse graph to route — as in FAISS's IVF-HNSW and PQ-augmented indexes.

Q: Can you delete vectors from an HNSW index?

Not cleanly. The original algorithm has no delete; the graph's neighbor lists assume every node stays reachable. Real systems use soft deletes (tombstone the vector, skip it in results) and periodically rebuild or compact the index. Hard deletion risks disconnecting the graph, because removing a hub node can strand the nodes that only reached the rest of the graph through it.

Q: Why is HNSW the heart of RAG?

Retrieval-augmented generation embeds your documents into vectors, then at query time embeds the question and fetches the nearest document chunks to stuff into the prompt. With millions of chunks, a brute-force cosine scan is too slow for an interactive request, so the retrieval step runs an approximate nearest-neighbor index — and HNSW is the most common choice because it returns the top-k neighbors in single-digit milliseconds at 95%+ recall.

A graph of shortcuts that finds your nearest neighbor in a million-vector haystack

HNSW (Hierarchical Navigable Small World) is a graph index for approximate nearest-neighbor search over embeddings, layering long-range shortcuts above dense local links so a greedy walk finds the closest vectors in roughly O(log n) hops — the engine behind RAG and vector databases.

Search time≈ O(log n) hops
Build timeO(n log n)
Memory per vectord floats + ~M·8 bytes links
Typical recall95–99%
Key parametersM, efConstruction, efSearch

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The problem: nearest neighbor in high dimensions

Modern AI turns everything — a sentence, an image, a product, a face — into a vector: a list of a few hundred to a few thousand floating-point numbers produced by an embedding model. Two things are "similar" when their vectors are close, usually by cosine similarity or Euclidean distance. So semantic search, recommendation, deduplication, and the retrieval step of RAG all reduce to one question: given a query vector, which stored vectors are closest?

The brute-force answer is to compare the query against every stored vector. For n vectors of dimension d, that's O(n·d) per query. At 10 million 768-dimension embeddings, each query touches 7.68 billion multiply-adds — tens of milliseconds even with SIMD, and it scales linearly as your corpus grows. Interactive search and high-throughput RAG can't afford that.

You might reach for a space-partitioning tree like a k-d tree. They're excellent in 2–3 dimensions and useless past about 20: the curse of dimensionality means almost all the volume of a high-dimensional ball sits near its surface, so the pruning that makes k-d trees fast collapses and they degrade to scanning nearly everything. The practical escape is to stop demanding the exact nearest neighbor and accept the approximate one — the right answer 95–99% of the time, hundreds of times faster. That's what HNSW delivers.

How the navigable small-world graph works

HNSW was published by Yu Malkov and Dmitry Yashunin in 2016 ("Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"). It builds on two ideas.

Navigable small world (NSW). Connect each vector to a handful of its nearest neighbors and you get a graph where you can find any target by greedy routing: from your current node, jump to whichever neighbor is closest to the query, and repeat until no neighbor is closer. The "small world" property — short average path length thanks to a few long-range links — means this walk reaches the target in surprisingly few hops, the same reason any two people are linked by ~6 handshakes.

Hierarchy (the skip-list trick). Plain NSW still wastes hops zig-zagging across the graph at the start. HNSW fixes this by stacking layers, exactly like a skip list stacks express lanes over a linked list. When a node is inserted it's assigned a maximum layer l drawn from an exponentially decaying distribution, l = ⌊−ln(uniform(0,1)) · mL⌋. Most nodes live only on layer 0; a few reach layer 1; a tiny number reach the top. Upper layers are sparse with long-range links; the base layer (layer 0) holds every node with dense local links.

Search is then a coarse-to-fine descent:

Start at the single entry point on the top layer.
On the current layer, greedily hop to neighbors closer to the query until you hit a local best. These hops are long, so they cover huge distances cheaply.
Drop down one layer using that best node as the new entry point, and repeat with progressively shorter hops.
On layer 0, run a beam search keeping the best ef candidates, then return the closest k.

The number of layers is O(log n) in expectation, and the work per layer is bounded by the degree M, giving the famous logarithmic search.

The mechanism and its complexity

Three parameters govern everything:

M — max neighbors per node per layer. Layer 0 allows up to M_max0 = 2·M; typical M is 12–48.
efConstruction — the candidate-list width used while building. Bigger = higher-quality graph, slower build. Typical 100–500.
efSearch (often just ef) — the candidate-list width at query time. Must be ≥ k. Bigger = higher recall, slower query.

The level assignment uses the normalization constant mL = 1 / ln(M), which makes the expected fraction of nodes on layer i+1 equal to 1/M of layer i. So the expected number of layers is log_M(n).

Operation	Cost	Why
Search (query)	O(log n) hops, each O(M·d) work	logarithmic layers × bounded degree × distance cost
Insert one vector	O(log n · M · d)	route down like a search, then link to M neighbors per layer
Build n vectors	O(n · log n · M · d)	n inserts
Memory	O(n · (d + M))	raw vectors plus neighbor lists
Delete	not supported natively	tombstone + rebuild (see edge cases)

One subtlety carries most of the quality: when you link a new node to its M closest candidates, you do not just take the M nearest. You run a heuristic neighbor selection that prefers a diverse spread of directions, dropping a candidate if it's closer to an already-chosen neighbor than to the new node. This keeps the graph navigable — without it, dense clusters form tight cliques with no shortcuts out, and greedy routing dead-ends.

When to use HNSW (and when not to)

Use it for in-memory approximate nearest-neighbor over embeddings: semantic search, RAG retrieval, recommendation candidate generation, image/audio similarity, dedup.
Use it when you need high recall at low latency and the dataset (up to tens of millions of vectors) fits in RAM.
Reconsider when memory is the binding constraint — HNSW stores full vectors plus links; product quantization or IVF compresses far better.
Reconsider for write-heavy workloads with frequent deletes; the graph doesn't support clean deletion and builds are not cheap.
Don't bother for exact search on small or low-dimensional data — a brute-force scan or a k-d tree is simpler and exact.

HNSW vs other ANN indexes

	HNSW (graph)	IVF (inverted file)	Product Quantization	LSH	Annoy (trees)	Brute force
Search time	O(log n) hops	O(n/nlist · d)	O(n · d') compressed	O(buckets)	O(log n · trees)	O(n · d)
Typical recall@10	95–99%	90–98% (tune nprobe)	80–95%	70–90%	85–95%	100%
Memory per vector	d floats + M links (high)	d floats + cluster id	~16–64 bytes (lowest)	hash codes	d floats + tree nodes	d floats
Build cost	O(n log n) — slow	O(n) + k-means train	O(n) + codebook train	O(n) fast	O(n log n)	none
Supports delete	tombstone + rebuild	yes (per list)	yes	yes	rebuild	trivial
Best for	recall@latency, RAG	large datasets, sharding	billion-scale, memory-bound	streaming, simple	read-only, mmap	tiny / exact

The headline trade is memory versus recall-at-latency. HNSW pays the most memory and the slowest build to buy the best recall at a given query latency, which is why FAISS, pgvector, Qdrant, Weaviate, and Milvus all ship it as a default. When datasets pass ~100M vectors and no longer fit in RAM, teams switch to IVF or PQ — often combining them, e.g. FAISS's IVF-HNSW that uses a coarse HNSW graph to pick IVF cells.

What the numbers actually say

Brute force at 10M × 768-dim: 7.68B FLOPs/query ≈ 20–50 ms even vectorized. HNSW returns top-10 at the same recall in under 1 ms per query on the same hardware — a 30–100× speedup. The ann-benchmarks project routinely shows HNSW at 5,000–20,000 queries/sec at 95% recall on SIFT1M.
Memory overhead: with M = 16, each node stores up to 16 neighbor IDs per layer. Almost all nodes are layer-0-only, so ~32 IDs × 4 bytes ≈ 128 bytes of links, on top of 768 × 4 = 3,072 bytes of vector. The graph adds roughly 4% memory — cheap. The expensive part is keeping the 3 KB raw vectors in RAM.
Recall is a dial, not a constant: on SIFT1M, raising efSearch from 16 → 64 → 256 typically moves recall@10 from ~90% → ~97% → ~99.5%, while per-query latency rises roughly linearly. You choose your point on that curve.
Build is the tax: indexing 1M vectors with M = 16, efConstruction = 200 takes on the order of 1–3 minutes single-threaded, scaling as O(n log n). Doubling efConstruction roughly doubles build time for a few points of recall.

JavaScript implementation

A compact but faithful HNSW: random level assignment, layered greedy descent, beam search on the base layer, and bidirectional linking on insert. Distance is squared Euclidean (monotonic with Euclidean, cheaper).

class HNSW {
  constructor({ M = 16, efConstruction = 200 } = {}) {
    this.M = M;
    this.Mmax0 = M * 2;
    this.efC = efConstruction;
    this.mL = 1 / Math.log(M);
    this.nodes = [];          // { vec, neighbors: [Set per layer] }
    this.entry = -1;          // id of top entry point
    this.maxLayer = -1;
  }

  dist(a, b) {                // squared Euclidean
    let s = 0;
    for (let i = 0; i < a.length; i++) { const d = a[i] - b[i]; s += d * d; }
    return s;
  }

  randomLevel() {
    return Math.floor(-Math.log(Math.random()) * this.mL);
  }

  // Greedy beam search on one layer; returns up to ef closest node ids.
  searchLayer(q, entryPoints, ef, layer) {
    const visited = new Set(entryPoints);
    // candidates: min-heap by distance (we use a sorted array for clarity)
    const cand = entryPoints.map(id => ({ id, d: this.dist(q, this.nodes[id].vec) }));
    const found = [...cand];
    cand.sort((a, b) => a.d - b.d);
    found.sort((a, b) => a.d - b.d);

    while (cand.length) {
      const c = cand.shift();                 // nearest unexpanded
      const worst = found[found.length - 1];
      if (c.d > worst.d && found.length >= ef) break;
      for (const n of this.nodes[c.id].neighbors[layer] || []) {
        if (visited.has(n)) continue;
        visited.add(n);
        const d = this.dist(q, this.nodes[n].vec);
        if (d < found[found.length - 1].d || found.length < ef) {
          cand.push({ id: n, d });  cand.sort((a, b) => a.d - b.d);
          found.push({ id: n, d }); found.sort((a, b) => a.d - b.d);
          if (found.length > ef) found.pop();
        }
      }
    }
    return found;
  }

  insert(vec) {
    const id = this.nodes.length;
    const level = this.randomLevel();
    const neighbors = Array.from({ length: level + 1 }, () => new Set());
    this.nodes.push({ vec, neighbors });

    if (this.entry === -1) { this.entry = id; this.maxLayer = level; return id; }

    let ep = [this.entry];
    // Phase 1: descend from top to level+1 with ef = 1 (pure greedy).
    for (let lc = this.maxLayer; lc > level; lc--) {
      ep = [this.searchLayer(vec, ep, 1, lc)[0].id];
    }
    // Phase 2: at each layer ≤ level, find efC candidates and link M of them.
    for (let lc = Math.min(level, this.maxLayer); lc >= 0; lc--) {
      const W = this.searchLayer(vec, ep, this.efC, lc);
      const Mlc = lc === 0 ? this.Mmax0 : this.M;
      const chosen = this.selectNeighbors(vec, W, Mlc);
      for (const nb of chosen) {
        this.nodes[id].neighbors[lc].add(nb);
        this.nodes[nb].neighbors[lc].add(id);
        // prune the neighbor back down to its cap
        if (this.nodes[nb].neighbors[lc].size > Mlc) {
          const reranked = this.selectNeighbors(
            this.nodes[nb].vec,
            [...this.nodes[nb].neighbors[lc]].map(x => ({ id: x, d: this.dist(this.nodes[nb].vec, this.nodes[x].vec) })),
            Mlc);
          this.nodes[nb].neighbors[lc] = new Set(reranked);
        }
      }
      ep = W.map(w => w.id);
    }
    if (level > this.maxLayer) { this.maxLayer = level; this.entry = id; }
    return id;
  }

  // Diversity heuristic: keep a candidate only if it's closer to the
  // query than to any already-selected neighbor.
  selectNeighbors(q, candidates, M) {
    const sorted = [...candidates].sort((a, b) => a.d - b.d);
    const result = [];
    for (const c of sorted) {
      if (result.length >= M) break;
      const closerToResult = result.some(r =>
        this.dist(this.nodes[c.id].vec, this.nodes[r].vec) < c.d);
      if (!closerToResult) result.push(c.id);
    }
    return result;
  }

  search(q, k = 10, ef = 50) {
    if (this.entry === -1) return [];
    let ep = [this.entry];
    for (let lc = this.maxLayer; lc > 0; lc--) {
      ep = [this.searchLayer(q, ep, 1, lc)[0].id];
    }
    const W = this.searchLayer(q, ep, Math.max(ef, k), 0);
    return W.slice(0, k).map(w => ({ id: w.id, dist: Math.sqrt(w.d) }));
  }
}

Two details that production libraries optimize but this version makes explicit: the selectNeighbors diversity heuristic (lines that drop redundant neighbors) is what keeps the graph navigable, and re-pruning a neighbor's list after adding a back-link is what stops popular hub nodes from accumulating unbounded degree. Swap the sorted-array candidate lists for real binary heaps and you have essentially the FAISS / hnswlib structure.

Python implementation

In practice you'd never write this — you'd call a library. The idiomatic one-liner uses hnswlib (the reference implementation by the paper's author):

import hnswlib
import numpy as np

dim, n = 768, 1_000_000
data = np.random.rand(n, dim).astype(np.float32)

index = hnswlib.Index(space='cosine', dim=dim)   # or 'l2', 'ip'
index.init_index(max_elements=n, M=16, ef_construction=200)
index.add_items(data, np.arange(n))              # build: O(n log n)

index.set_ef(64)                                 # query-time recall dial (ef ≥ k)
query = np.random.rand(dim).astype(np.float32)
labels, distances = index.knn_query(query, k=10) # ~sub-millisecond
print(labels, distances)

And the algorithm itself, mirroring the JS above so the structure is clear in a language without classes-everywhere boilerplate:

import math, random, heapq

class HNSW:
    def __init__(self, M=16, ef_construction=200):
        self.M, self.Mmax0, self.efC = M, 2 * M, ef_construction
        self.mL = 1 / math.log(M)
        self.vecs, self.links = [], []     # links[id][layer] = set of ids
        self.entry, self.max_layer = -1, -1

    def dist(self, a, b):
        return sum((x - y) ** 2 for x, y in zip(a, b))   # squared L2

    def random_level(self):
        return int(-math.log(random.random()) * self.mL)

    def search_layer(self, q, entry_points, ef, layer):
        visited = set(entry_points)
        # candidates: min-heap by distance; found: max-heap (store -d)
        cand = [(self.dist(q, self.vecs[e]), e) for e in entry_points]
        heapq.heapify(cand)
        found = [(-d, e) for d, e in cand]
        heapq.heapify(found)
        while cand:
            d, c = heapq.heappop(cand)
            if -found[0][0] < d and len(found) >= ef:
                break
            for nb in self.links[c][layer] if layer < len(self.links[c]) else []:
                if nb in visited:
                    continue
                visited.add(nb)
                dn = self.dist(q, self.vecs[nb])
                if dn < -found[0][0] or len(found) < ef:
                    heapq.heappush(cand, (dn, nb))
                    heapq.heappush(found, (-dn, nb))
                    if len(found) > ef:
                        heapq.heappop(found)
        return sorted([(-md, i) for md, i in found])     # (dist, id) ascending

    def select_neighbors(self, q_id, candidates, M):
        result = []
        for d, cid in sorted(candidates):
            if len(result) >= M:
                break
            if all(self.dist(self.vecs[cid], self.vecs[r]) >= d for r in result):
                result.append(cid)
        return result

    def insert(self, vec):
        i = len(self.vecs)
        level = self.random_level()
        self.vecs.append(vec)
        self.links.append([set() for _ in range(level + 1)])
        if self.entry == -1:
            self.entry, self.max_layer = i, level
            return i
        ep = [self.entry]
        for lc in range(self.max_layer, level, -1):
            ep = [self.search_layer(vec, ep, 1, lc)[0][1]]
        for lc in range(min(level, self.max_layer), -1, -1):
            W = self.search_layer(vec, ep, self.efC, lc)
            Mlc = self.Mmax0 if lc == 0 else self.M
            for nb in self.select_neighbors(i, W, Mlc):
                self.links[i][lc].add(nb)
                self.links[nb][lc].add(i)
            ep = [w[1] for w in W]
        if level > self.max_layer:
            self.max_layer, self.entry = level, i
        return i

    def search(self, q, k=10, ef=50):
        if self.entry == -1:
            return []
        ep = [self.entry]
        for lc in range(self.max_layer, 0, -1):
            ep = [self.search_layer(q, ep, 1, lc)[0][1]]
        W = self.search_layer(q, ep, max(ef, k), 0)
        return [(idx, math.sqrt(d)) for d, idx in W[:k]]

Variants worth knowing

NSW (flat navigable small world). HNSW's single-layer ancestor — same greedy routing, no hierarchy. Simpler and lower-memory, but search degrades to polylogarithmic-with-a-bad-constant on large or clustered data because early hops are short.

IVF-HNSW. Use a small HNSW graph as the coarse quantizer that assigns a query to inverted-file cells, then scan those cells. Ships in FAISS; combines HNSW's routing with IVF's memory profile for billion-scale corpora.

HNSW + product quantization (HNSWPQ / "HNSW with SQ/PQ"). Store compressed codes instead of raw vectors to cut memory 10–50×, re-ranking the top candidates with full precision. The standard way to fit large indexes in RAM at a small recall cost.

DiskANN / Vamana. Microsoft's graph index designed for SSD, not RAM — a single flat graph with a different pruning rule (the Vamana algorithm) that keeps the graph navigable while spilling vectors to disk. Wins when the corpus is far larger than memory.

ScaNN. Google's anisotropic-quantization approach. Not a graph at all — it's a learned, partition-plus-quantize index that beats HNSW on some recall/latency points, especially with inner-product (maximum inner product search).

Common bugs and edge cases

Mixing distance metrics. If you trained embeddings for cosine similarity but build the index with L2, recall craters. Normalize vectors to unit length and L2 becomes monotonic with cosine, or just set the metric explicitly. This is the single most common production bug.
ef < k. efSearch must be at least k, or the beam can't hold k results. Many wrappers silently clamp it; if your top-k looks short or low-quality, raise ef.
Deleting nodes. Removing a vector — especially a high-layer hub — can disconnect the graph and strand whole regions. Use soft deletes (tombstones) and rebuild periodically; don't surgically unlink nodes from a live HNSW.
Forgetting the diversity heuristic. If you link each new node to its M absolute-nearest neighbors instead of running selectNeighbors, dense clusters form cliques with no exits, greedy routing dead-ends inside them, and recall silently drops on clustered data.
Unbounded degree. Bidirectional linking means a popular node accumulates back-links from everyone who chose it. Without re-pruning each neighbor back to M_max, hub nodes balloon, memory blows up, and per-hop cost rises.
Expecting determinism. Level assignment and tie-breaking are randomized, so two builds of the same data produce different graphs and slightly different approximate results. Seed the RNG if you need reproducible recall in tests.
Indexing before vectors fit in RAM. HNSW assumes random access to every vector during build and search. If the index spills to swap, latency goes from microseconds to milliseconds per hop — reach for DiskANN or PQ instead.

Frequently asked questions

Why is HNSW only approximate, not exact?