Machine Learning

Word Embeddings (word2vec)

Where meaning becomes geometry — and arithmetic on words actually works

Word embeddings map every word to a dense vector — typically 100 to 300 dimensions — so that words with similar meanings sit close together, and the geometry itself encodes analogies: the vector for king − man + woman lands nearest to queen.

IntroducedMikolov et al., 2013
Typical dimensions100–300
Training cost / stepO(k) with negative sampling
Naive softmax costO(V) — millions
Similarity metriccosine

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The idea: you are the company you keep

For decades, computers treated words as atomic symbols. "cat" was one slot in a dictionary, "dog" was another, and to a machine they were exactly as related as "cat" and "thermodynamics" — which is to say, not at all. This is the one-hot representation: a vocabulary of V words becomes V orthogonal axes, every word a unit vector pointing down its own axis. Every pair of words is equidistant. There's no geometry of meaning, just a lookup table.

Word embeddings throw that away. The idea rests on the distributional hypothesis, stated by linguist J.R. Firth in 1957: "You shall know a word by the company it keeps." Words that appear in similar contexts — "the ___ barked", "I fed the ___" — probably mean similar things. So if we force a model to predict context from words (or words from context), the internal representation it learns to do that job has to place "dog" and "puppy" near each other, because they're interchangeable in text.

In 2013 Tomáš Mikolov and colleagues at Google published word2vec — a pair of shallow models (skip-gram and CBOW) that learn these vectors fast enough to train on billions of words on a single machine. The shocking result wasn't just that synonyms clustered. It was that relationships became directions. The displacement from "man" to "woman" turned out to be nearly the same vector as "king" to "queen", "uncle" to "aunt", "brother" to "sister". Meaning had become arithmetic.

How word2vec learns the vectors

Take the skip-gram model, the more popular of the two. You slide a window across the corpus. At each position, the center word is the input, and the words within a window of size c (say ±5) are the targets it must predict. The model has two matrices of learnable parameters:

Input embeddings W — a V × d matrix; row w is the d-dimensional vector for word w when it's the center word. These are the embeddings you ship.
Output embeddings W' — a second V × d matrix used to score context words. Usually discarded after training.

For a center word w and a context word o, the probability under the full softmax is

P(o | w) = exp(u_o · v_w) / Σ_{j=1..V} exp(u_j · v_w)

where v_w is the input vector of the center word and u_o the output vector of the context word. Maximize the log-probability of the real context across the corpus and you have your objective. The dot product u · v is high when two words tend to co-occur, so gradient descent pulls co-occurring words' vectors into alignment.

The catch is the denominator: that sum runs over every word in the vocabulary. With V = 1,000,000, every single training step does a million dot products. That's where negative sampling comes in — the trick that made word2vec practical.

Negative sampling: the speed trick

Instead of computing a normalized probability over the whole vocabulary, reframe the problem as binary classification. For each true (center, context) pair, draw k random "negative" words that probably aren't in the context. Train a logistic classifier to say "yes, this pair co-occurs" for the real pair and "no" for the k fakes. The per-pair loss becomes

L = −log σ(u_o · v_w) − Σ_{i=1..k} log σ(−u_{n_i} · v_w)

where σ is the sigmoid. Now each step touches only k + 1 output vectors instead of V. With k = 5–20, that's the difference between 20 dot products and 1,000,000 — a 50,000× reduction in arithmetic per step on a million-word vocabulary.

Two details that matter. Negatives are sampled from a smoothed unigram distribution, P(w) ∝ count(w)^0.75, which dampens the dominance of "the" and "of" while still drawing common words more often. And word2vec subsamples frequent words during training: a word with frequency f is dropped with probability roughly 1 − √(t/f) for threshold t ≈ 10⁻⁵, so "the" doesn't drown out the signal from rare, informative words.

When to reach for static embeddings

You need speed and a tiny footprint. A 300-dim, 1M-word table is ~1.2 GB in float32, but a lookup is a single array index — nanoseconds. No GPU, no forward pass.
Semantic similarity and clustering — "find me documents like this one", deduplication, recommendation by averaged word vectors.
Feature initialization for a downstream model on a small dataset, where pretrained vectors inject knowledge you couldn't learn from your own labels.
Interpretability experiments — analogy probing, bias auditing, exploring the geometry of a domain vocabulary.

Reach for contextual embeddings instead when word sense matters (polysemy, negation, long-range syntax) and you can afford a transformer forward pass. word2vec gives "bank" one vector; BERT gives it a different vector in "river bank" than in "central bank".

word2vec vs GloVe vs fastText vs contextual

	word2vec	GloVe	fastText	BERT / GPT (contextual)
Trained on	local windows (predictive)	global co-occurrence matrix	local windows + subwords	full sentences (self-attention)
Vector per word	one fixed	one fixed	one fixed (sum of n-grams)	context-dependent
Out-of-vocabulary words	fails — unknown token	fails — unknown token	handles via char n-grams	handles via subword tokenizer
Typical dimensions	100–300	50–300	100–300	768–4096
Training cost	hours on CPU	hours (matrix factorization)	hours on CPU	days–weeks on many GPUs
Inference cost	O(1) lookup	O(1) lookup	O(n-grams) lookup	full forward pass
Captures word sense	no	no	no	yes
Year / origin	2013, Google	2014, Stanford	2016, Facebook	2018+, Google/OpenAI

GloVe (Pennington, Socher, Manning, 2014) reaches a similar geometry from a different angle: factor the global word-word co-occurrence matrix so that v_i · v_j ≈ log(count of i near j). fastText extends word2vec by representing a word as the sum of its character n-grams, so it can embed words it never saw in training — useful for morphologically rich languages and typos.

What the numbers actually say

Negative sampling is the headline win. Full softmax: V = 1,000,000 dot products per step. Negative sampling with k = 15: 16 dot products. That's a ~62,500× reduction, and it's what let word2vec train on a 100-billion-word Google News corpus in under a day on a single multicore machine in 2013.
The analogy benchmark. On the original Google analogy set of ~19,500 questions, 300-dim skip-gram with negative sampling hit roughly 61% accuracy on semantic analogies (capital-of-country, currency, family relations) — a number that startled the field, because nobody had asked the model to learn analogies at all.
Memory. 1M words × 300 dims × 4 bytes = 1.2 GB. Quantize to int8 and it's 300 MB; product-quantize and it fits in tens of MB with a small recall hit.
Lookup vs forward pass. An embedding lookup is one memory fetch, ~100 ns. A BERT-base forward pass for one sentence is ~10 ms on CPU — five orders of magnitude more. That gap is exactly why static embeddings are still everywhere in latency-sensitive systems.

JavaScript: similarity and analogy queries

You rarely train word2vec in the browser, but querying a pretrained table is just linear algebra. Here's cosine similarity and the famous analogy solver:

// vectors: Map<string, Float32Array> of L2-unnormalized embeddings
function dot(a, b) {
  let s = 0;
  for (let i = 0; i < a.length; i++) s += a[i] * b[i];
  return s;
}
function norm(a) { return Math.sqrt(dot(a, a)); }

function cosine(a, b) {
  return dot(a, b) / (norm(a) * norm(b) + 1e-9);
}

// Most similar words to a query vector, excluding a set of words.
function nearest(vectors, query, exclude = new Set(), topK = 5) {
  const scored = [];
  for (const [word, vec] of vectors) {
    if (exclude.has(word)) continue;
    scored.push([word, cosine(query, vec)]);
  }
  scored.sort((a, b) => b[1] - a[1]);
  return scored.slice(0, topK);
}

// king - man + woman ≈ ?
function analogy(vectors, a, b, c, topK = 3) {
  const va = vectors.get(a), vb = vectors.get(b), vc = vectors.get(c);
  if (!va || !vb || !vc) throw new Error('word not in vocabulary');
  const target = new Float32Array(va.length);
  for (let i = 0; i < target.length; i++) target[i] = va[i] - vb[i] + vc[i];
  // exclude the three input words — they'd otherwise win trivially
  return nearest(vectors, target, new Set([a, b, c]), topK);
}

// analogy(vectors, 'king', 'man', 'woman')  →  [['queen', 0.78], ...]

Note the exclude set in analogy. Without it the nearest word to "king − man + woman" is almost always "king" itself, because the input still dominates — a classic gotcha when people first reproduce the demo and get a "broken" result.

Python: training skip-gram with negative sampling

A minimal-but-real training loop in NumPy. In production you'd use Gensim or PyTorch, but the math is exactly this:

import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -30, 30)))

class SkipGramNS:
    def __init__(self, vocab_size, dim=300, k=15, lr=0.025, seed=0):
        rng = np.random.default_rng(seed)
        # input (center) and output (context) embedding tables
        self.W  = (rng.random((vocab_size, dim)) - 0.5) / dim   # the vectors you keep
        self.Wp = np.zeros((vocab_size, dim))                   # output vectors
        self.k, self.lr, self.V = k, lr, vocab_size

    def train_pair(self, center, context, neg_sampler):
        v = self.W[center]                       # (dim,)
        # one positive sample (label 1) + k negatives (label 0)
        targets = [(context, 1.0)]
        targets += [(neg_sampler(), 0.0) for _ in range(self.k)]
        grad_v = np.zeros_like(v)
        for word, label in targets:
            u = self.Wp[word]
            score = sigmoid(v @ u)
            g = self.lr * (label - score)        # gradient of log-loss
            grad_v += g * u
            self.Wp[word] += g * v               # update output vector
        self.W[center] += grad_v                 # update input vector

# neg_sampler draws from P(w) ∝ count(w) ** 0.75 (build an alias table once)
# Iterate train_pair over every (center, context) window in the corpus.

The whole model is two embedding tables and a sigmoid. There is no hidden layer, no nonlinearity beyond the sigmoid in the loss — which is why word2vec is often called a "shallow" model. The intelligence lives entirely in the geometry of W.

Variants worth knowing

CBOW (continuous bag-of-words). The mirror image of skip-gram: average the context vectors and predict the center word. Faster (one prediction per window instead of 2c), slightly better on frequent words, worse on rare ones. word2vec ships both; you pick with a flag.

GloVe. Skips the predictive framing entirely. Build the global co-occurrence count matrix once, then learn vectors so that v_i · v_j + b_i + b_j ≈ log X_ij, weighting rare and frequent pairs differently. Same geometry, different objective; trains by matrix factorization rather than streaming windows.

fastText. Represents each word as the sum of vectors for its character n-grams (e.g. "where" → <wh, whe, her, ere, re>). It can therefore embed out-of-vocabulary words and typos, and it shines on agglutinative languages where word2vec's whole-word vectors fragment the vocabulary.

Doc2vec / paragraph vectors. Add a per-document vector to the context so the model learns an embedding for an entire paragraph or document, not just words.

Contextual embeddings (ELMo, BERT, GPT). The successors. The vector for a word is computed by a deep network from the whole sentence, so word sense is resolved. They subsume static embeddings for accuracy but cost a forward pass per query rather than a table lookup.

Common bugs and edge cases

Forgetting to exclude the query words in analogies. "king − man + woman" returns "king" unless you blacklist the three inputs. The classic first-run "it's broken" bug.
Ranking by Euclidean distance instead of cosine. Frequent words get longer vectors; magnitude correlates with frequency, not meaning. Always normalize or use cosine, or "the" and "of" pollute every neighbor list.
Out-of-vocabulary crashes. word2vec and GloVe have no vector for a word they never saw. Guard the lookup, fall back to a subword model (fastText), or map unknowns to a learned <unk> token.
Treating analogies as exact. The arithmetic is approximate and resolved by nearest-neighbor search. "queen" is the closest word, not an exact landing point — and on harder relations accuracy drops to chance.
Inheriting bias. Vectors absorb the statistics of the training text, including stereotypes: "man : computer programmer :: woman : homemaker" famously fell out of Google News word2vec. If you ship these vectors downstream, you ship the bias too.
Comparing vectors from different runs. The axes are arbitrary and not aligned across training runs — "dimension 42" means nothing, and two separately trained tables aren't directly comparable without an alignment step.
Using the output matrix by mistake. word2vec learns two tables; the embeddings you want are the input matrix W. Some libraries average the two, but using W' alone gives noticeably worse similarity.

Frequently asked questions

Why does king − man + woman ≈ queen actually work?

Because word2vec is trained on co-occurrence, certain directions in the vector space come to encode consistent semantic contrasts. The offset from man to woman is roughly the same vector as the offset from king to queen — a "gender" direction. Subtracting man and adding woman shifts king along that direction, landing near queen. The relationship is approximate, and the answer is found by nearest-neighbor search, not an exact match.

What's the difference between skip-gram and CBOW?

Skip-gram predicts the surrounding context words from a single center word; CBOW (continuous bag-of-words) predicts the center word from the averaged context. Skip-gram trains slower but produces better vectors for rare words and small corpora; CBOW is faster and slightly better on frequent words. Mikolov's word2vec ships both.

What is negative sampling and why is it needed?

The naive softmax over the whole vocabulary costs O(V) per training step — millions of words means millions of dot products per word pair. Negative sampling replaces it with a binary classification: push the true (word, context) pair toward 1 and a handful of random "negative" pairs toward 0. With k = 5 to 20 negatives, each step costs O(k) instead of O(V), a tens-of-thousands-fold speedup on a 1-million-word vocabulary.

How many dimensions should a word embedding have?

Typically 100 to 300. Google's published word2vec vectors are 300-dimensional; GloVe ships 50, 100, 200, and 300-dimensional versions. More dimensions capture finer distinctions but cost more memory and risk overfitting on small corpora. Below ~50 the vectors lose too much nuance; above ~300 returns diminish sharply for most tasks.

How is word2vec different from the embeddings inside BERT or GPT?

word2vec gives every word one fixed vector regardless of sentence — "bank" (river) and "bank" (money) share a single point. Transformer models like BERT and GPT produce contextual embeddings: the vector for "bank" depends on the whole sentence, so the two senses separate. word2vec is static and cheap; contextual embeddings are dynamic and far more expensive to compute.

Why use cosine similarity instead of Euclidean distance?

Cosine similarity measures the angle between two vectors and ignores their length. In word2vec, frequent words tend to get longer vectors, so Euclidean distance would conflate "rare" with "far apart". The angle captures direction — i.e. meaning — independent of magnitude, which is why almost all embedding lookups rank by cosine.