Sampling Algorithms

Alias Method Sampling

O(1) per draw from any discrete distribution, after O(n) setup

Walker's alias method samples from a discrete distribution in O(1) time after O(n) preprocessing. Two arrays, one uniform draw, one biased coin — no binary search, no tree walks.

  • PreprocessingO(n)
  • Sample timeO(1)
  • SpaceO(n) (2 arrays)
  • Static distribution?Yes — rebuild on change
  • Numerically stableVose's variant
  • Best alternativeCDF + binary search O(log n)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How the alias method works

Suppose you have a five-letter distribution: A with probability 0.05, B with 0.40, C with 0.10, D with 0.30, E with 0.15. The naive way to sample is to compute the cumulative distribution function, pick a uniform u ∈ [0, 1), and binary-search to find the right bucket. That's O(log n) per draw.

The alias method does it in two array lookups, regardless of n. The trick is the preprocessing: rearrange the probability mass so every outcome lives in a slot of equal height 1/n. Each slot can hold mass from at most two different outcomes — one "primary," one "alias." Once that structure exists, sampling is mechanical:

  1. Pick a uniform integer i ∈ [0, n) — that selects the slot.
  2. Pick a uniform fraction u ∈ [0, 1) and compare it to prob[i]: if u < prob[i], return outcome i; otherwise return alias[i].

Each slot has total probability 1/n, split between the primary and the alias. Picking the slot uniformly gives each slot 1/n of the total mass, and within the slot the biased coin allocates the right fraction to each outcome. Sum across slots and the marginals come out exactly equal to the original probabilities.

Trace: build the alias table for [0.05, 0.40, 0.10, 0.30, 0.15]

Step 1: scale each probability by n = 5 so the average is 1.

scaled = [0.25, 2.00, 0.50, 1.50, 0.75]    // A, B, C, D, E

Small (< 1):  A (0.25), C (0.50), E (0.75)
Large (≥ 1):  B (2.00), D (1.50)

Step 2: repeatedly pair one small with one large. The small keeps its own mass and stores the large as its alias; the large is reduced by what it donated.

Pair (A, B):  prob[A] = 0.25, alias[A] = B
              B → 2.00 − (1 − 0.25) = 1.25  (still large)

Pair (C, B):  prob[C] = 0.50, alias[C] = B
              B → 1.25 − (1 − 0.50) = 0.75  (becomes small!)

Pair (E, D):  prob[E] = 0.75, alias[E] = D
              D → 1.50 − (1 − 0.75) = 1.25  (still large)

Pair (B, D):  prob[B] = 0.75, alias[B] = D
              D → 1.25 − (1 − 0.75) = 1.00  (becomes done)

Final:        prob[D] = 1.00, alias[D] = D  (no alias needed)

Step 3: every slot now sums to exactly 1/n in probability terms:

slot   prob   alias
 A     0.25     B
 B     0.75     D
 C     0.50     B
 D     1.00     D    (self-alias, never used)
 E     0.75     D

Total work: 4 pair operations for n = 5 outcomes, each constant time. The pairing list shrinks by one per step, giving exactly n − 1 pairings overall.

Verifying the marginals

P(return A) = chance we picked slot A AND took the primary = (1/5) × 0.25 = 0.05

P(return B) = primary in B + alias in A + alias in C = (1/5)(0.75) + (1/5)(1 − 0.25) + (1/5)(1 − 0.50) = 0.15 + 0.15 + 0.10 = 0.40

P(return D) = primary in D + alias in B + alias in E = (1/5)(1.00) + (1/5)(1 − 0.75) + (1/5)(1 − 0.75) = 0.20 + 0.05 + 0.05 = 0.30

The construction is a bijection between (random slot, random coin flip) and (original outcome). No probability mass is invented or lost.

When to use the alias method

  • Many samples from a fixed distribution. The break-even point against CDF binary search lands around 50–100 samples per build for typical n = 1000. Past that, alias dominates.
  • Latency-sensitive hot loops. Two array lookups translate to roughly 4 ns including the branch. Binary search through a CDF of 1000 items takes ~40 ns — 10× slower per draw.
  • Large n. Word2vec's negative sampling table has ~10⁶ entries; alias gives O(1) per training step versus O(log₂ 10⁶) ≈ 20 comparisons.
  • Independent draws. If you need without-replacement sampling, the alias structure invalidates after each draw — use a reservoir or Fenwick tree instead.

If your distribution changes every few draws, the O(n) rebuild swamps the O(1) sample advantage. For genuinely dynamic distributions, a Fenwick tree over the CDF gives O(log n) update and O(log n) sample — a better fit.

Alias method vs alternatives

Alias methodCDF + binary searchLinear scanRejection samplingFenwick treeReservoir
PreprocessingO(n)O(n)NoneO(n)O(n)None
Per sampleO(1)O(log n)O(n)O(1) expectedO(log n)O(n)
Per updateO(n) rebuildO(n)O(1)O(1)O(log n)O(n)
SpaceO(n) × 2 arraysO(n)O(n)O(n)O(n)O(k)
Streaming-friendlyNoNoNoNoNoYes
Best forMany draws, staticFew drawsTiny nBounded acceptance rateDynamic weightsPass-once stream

What "constant time" means in nanoseconds

The alias sampler is two memory reads plus a comparison. On modern hardware with the table in L1 cache, each draw clocks in around 4–8 nanoseconds. Compare to CDF binary search with the same n:

nAlias per sampleCDF binary searchSpeedup
100~5 ns~25 ns (log₂≈7)
1,000~5 ns~40 ns (log₂≈10)
10,000~6 ns~60 ns (log₂≈13)10×
1,000,000~6 ns~100 ns (log₂≈20)17×
10,000,000~7 ns~140 ns (log₂≈23)20×

The alias method's per-sample time is essentially flat — log n doesn't matter. Binary search's grows as log n, and worse, the branch-mispredict rate climbs because each comparison's outcome is unpredictable. word2vec's authors switched to alias sampling specifically for this reason; the inner training loop sped up roughly 3× on a 70,000-word vocabulary.

JavaScript implementation

// Vose's alias method — O(n) construction, O(1) sampling.
class AliasSampler {
  constructor(probs) {
    const n = probs.length;
    this.n = n;
    this.prob = new Float64Array(n);
    this.alias = new Int32Array(n);

    const scaled = probs.map(p => p * n);
    const small = [], large = [];
    for (let i = 0; i < n; i++) {
      (scaled[i] < 1 ? small : large).push(i);
    }

    while (small.length && large.length) {
      const s = small.pop();
      const l = large.pop();
      this.prob[s] = scaled[s];
      this.alias[s] = l;
      scaled[l] = scaled[l] - (1 - scaled[s]);
      (scaled[l] < 1 ? small : large).push(l);
    }
    // Drain remainders (floating-point cleanup)
    while (large.length) { const l = large.pop(); this.prob[l] = 1; this.alias[l] = l; }
    while (small.length) { const s = small.pop(); this.prob[s] = 1; this.alias[s] = s; }
  }

  sample() {
    const i = Math.floor(Math.random() * this.n);
    return Math.random() < this.prob[i] ? i : this.alias[i];
  }
}

// Example: A=0.05, B=0.40, C=0.10, D=0.30, E=0.15
const sampler = new AliasSampler([0.05, 0.40, 0.10, 0.30, 0.15]);
const labels = ['A', 'B', 'C', 'D', 'E'];
const counts = new Array(5).fill(0);
for (let i = 0; i < 1_000_000; i++) counts[sampler.sample()]++;
console.log(counts.map((c, i) => `${labels[i]}: ${(c / 1e4).toFixed(2)}%`));
// → A: 5.00%, B: 39.96%, C: 10.04%, D: 29.98%, E: 15.02% (within ±0.1%)

Python implementation

import random
from collections import deque

class AliasSampler:
    def __init__(self, probs):
        n = len(probs)
        self.n = n
        self.prob = [0.0] * n
        self.alias = [0] * n
        scaled = [p * n for p in probs]
        small, large = deque(), deque()
        for i, s in enumerate(scaled):
            (small if s < 1 else large).append(i)
        while small and large:
            s = small.pop()
            l = large.pop()
            self.prob[s] = scaled[s]
            self.alias[s] = l
            scaled[l] -= (1 - scaled[s])
            (small if scaled[l] < 1 else large).append(l)
        for q in (large, small):
            while q:
                i = q.pop(); self.prob[i] = 1.0; self.alias[i] = i

    def sample(self):
        i = random.randrange(self.n)
        return i if random.random() < self.prob[i] else self.alias[i]

# Heavy duty in production:
import numpy as np
probs = np.array([0.05, 0.40, 0.10, 0.30, 0.15])
samples = np.random.choice(len(probs), size=1_000_000, p=probs)
# numpy.random.choice uses CDF + searchsorted (O(log n) per sample)

Famous problem: word2vec negative sampling

word2vec trains word embeddings by maximizing the dot product between a target word and its context, while minimizing it against random "negative" words drawn from the corpus. The negative draws use a unigram distribution raised to the 3/4 power — typically over a vocabulary of 10⁵ to 10⁶ words.

def build_negative_sampler(word_counts, power=0.75):
    total = sum(c ** power for c in word_counts.values())
    probs = [c ** power / total for c in word_counts.values()]
    return AliasSampler(probs)

# In training: 5-20 negative samples per positive — millions of draws
neg_sampler = build_negative_sampler(corpus_counts)
for positive_word in stream:
    negatives = [neg_sampler.sample() for _ in range(5)]
    update_embeddings(positive_word, negatives)

With ~10⁶ training examples per second on modern GPUs, the alias method's O(1) draw saves roughly an order of magnitude over CDF binary search on a 70k-word vocabulary. The original word2vec paper used a 100-million-entry expanded table for the same effect; the alias method gets the same O(1) draw with 10⁻⁴ the memory.

Numerical stability with Vose's two-queue construction

Naive alias construction can produce prob[i] slightly > 1 or < 0 due to floating-point error, breaking sampling. Vose's two-queue variant guarantees correctness:

def vose_alias(probs):
    n = len(probs)
    prob = [0.0] * n
    alias = [0] * n
    scaled = [p * n for p in probs]
    # Two queues, not stacks — preserves FIFO order for stability
    small = deque(i for i, s in enumerate(scaled) if s < 1)
    large = deque(i for i, s in enumerate(scaled) if s >= 1)

    while small and large:
        s = small.popleft()    # popleft, not pop: queue, not stack
        l = large.popleft()
        prob[s] = scaled[s]
        alias[s] = l
        scaled[l] = (scaled[l] + scaled[s]) - 1   # rearranged for stability
        (small if scaled[l] < 1 else large).append(l)
    while large: i = large.popleft(); prob[i] = 1.0
    while small: i = small.popleft(); prob[i] = 1.0
    return prob, alias

The key fix is the rearrangement scaled[l] + scaled[s] − 1 instead of scaled[l] − (1 − scaled[s]). Mathematically identical; numerically more stable because it avoids subtracting two close-to-1 numbers.

Common bugs and edge cases

  • Sampling from an outdated table. The alias structure assumes static probabilities. If your application updates weights between samples (e.g., dynamic loot tables), you must rebuild — silently using a stale table produces biased draws.
  • Probabilities that don't sum to 1. The construction scales by n assuming the input sums to exactly 1. Floating-point drift can leave probabilities summing to 1 + ε or 1 − ε, producing tiny but real bias. Normalize the input first.
  • Single-outcome distribution. If n = 1, both queues are empty after the first scaling; ensure prob[0] = 1, alias[0] = 0 as a base case.
  • Confusing prob[i] with the original probability. prob[i] in the alias table is the fraction of slot i that stays as outcome i — not pi. Reading prob[i] as if it were the marginal probability is a classic newcomer trap.
  • Using cheap RNGs. Two random calls per draw means your RNG must be fast. Math.random() in V8 is fine; cryptographic RNGs (e.g., crypto.getRandomValues) are 100× slower and erase the alias method's speed advantage.

Frequently asked questions

Why is the alias method O(1) per sample?

After preprocessing, sampling is exactly two operations: pick a uniform integer i in [0, n) for the column, then read prob[i] and compare to a uniform fraction u. If u < prob[i], return i; else return alias[i]. Two array lookups, one comparison — constant time, no matter how many outcomes the distribution has.

How does Walker's construction redistribute probability?

Multiply each probability by n so the average is exactly 1. Split outcomes into "small" (height < 1) and "large" (height ≥ 1) lists. Pair a small with a large: the small keeps its own mass, the large donates the rest to fill the small's slot to 1, then the large is reduced by what it gave. Repeat until both lists are empty. Each pair is O(1), total O(n).

What's the difference between Walker's and Vose's alias methods?

Walker's original 1974 paper described the structure but used an O(n log n) construction. Michael Vose's 1991 refinement uses two queues of small and large indices, achieving guaranteed O(n) construction with numerical stability. Modern implementations almost always use Vose's variant — the structure is the same, only the preprocessing differs.

Why not just use a CDF and binary search?

CDF + binary search is O(log n) per sample, which is fine for one-off draws but matters when you sample millions of times. For a 10,000-outcome distribution, log₂(10000) ≈ 13 comparisons per sample versus alias's 2 array lookups — roughly a 6-8× speedup in tight loops. Particle systems, MCMC samplers, and game RNGs all care about this constant factor.

Does the alias method work with dynamic distributions?

No — any change to a probability requires O(n) rebuild of the alias table. For dynamic weights, use a Fenwick tree over CDF (O(log n) update and sample) or a weighted reservoir. The alias method assumes static distributions sampled many times, which fits most real-world cases.

How is the alias method used in real systems?

word2vec uses it for negative sampling — drawing vocabulary words from a unigram distribution millions of times during training. Particle systems in games sample emitters by intensity. MCMC libraries like PyMC use it under the hood for Gibbs steps. Loot tables in MMOs and weighted matchmaking systems use it for O(1) draws from thousands of options.