Question 1

What is the Earth Mover's Distance intuitively?

Accepted Answer

Imagine two distributions as piles of dirt arranged on a 1D line (or 2D plane, or any metric space). The Earth Mover's Distance is the minimum amount of work — mass times distance moved — needed to transform one pile into the other. Mass × distance, summed over all the dirt you have to shovel. For two equal-mass piles, you must move every grain from where it is in pile μ to where it should be in pile ν, and the optimal plan minimizes the total work. The Wasserstein-1 distance is this optimal cost. Wasserstein-p uses distance raised to the p-th power as the cost per unit mass — so W₂ heavily penalizes moving mass long distances, while W₁ penalizes linearly. The Earth Mover's name was coined by Rubner et al. (2000) in the image retrieval community.

Question 2

Why is Wasserstein better than KL divergence?

Accepted Answer

KL divergence has three problems Wasserstein avoids. First, KL is asymmetric: KL(μ‖ν) ≠ KL(ν‖μ). Second, KL is infinite when μ and ν have disjoint supports — if one distribution puts zero mass where the other puts positive mass, KL goes to infinity even if the distributions are 'close' in any geometric sense. Third, KL ignores the geometry of the underlying space: shifting a distribution by a small amount doesn't decrease KL gracefully. Wasserstein is symmetric (a metric), is finite for any distributions with finite p-th moments regardless of support, and respects geometry (W(δ_x, δ_y) = |x − y|, the actual distance between point masses). This makes Wasserstein a natural choice for training generative models like WGAN, where the model and target distribution may have non-overlapping supports during early training.

Question 3

What is the Kantorovich-Rubinstein duality?

Accepted Answer

Kantorovich-Rubinstein duality is the variational formula W_1(μ, ν) = sup over 1-Lipschitz functions f of |∫f d(μ − ν)|. Computing the infimum over couplings (the primal problem) is hard for general distributions. Computing the supremum over Lipschitz functions is more tractable — and that's exactly what Wasserstein GAN does. The discriminator (critic) in WGAN learns a 1-Lipschitz function approximating f*, the supremum maximizer, and reports back the Wasserstein distance estimate to drive the generator's update. The 1-Lipschitz constraint is enforced by weight clipping (original WGAN, Arjovsky et al. 2017) or gradient penalty (WGAN-GP, Gulrajani et al. 2017). This is what made WGAN's training stable in cases where original GAN diverged.

Question 4

How is Wasserstein used in WGAN training?

Accepted Answer

Wasserstein GAN replaces the original GAN's JS divergence with the Wasserstein-1 distance W_1 between the generator distribution P_g and target P_data. The critic (a neural network replacing the discriminator) maximizes E_(x~P_data)[f(x)] − E_(x~P_g)[f(x)] subject to f being 1-Lipschitz — this is the dual form of W_1. The generator minimizes the same quantity. Crucially, even when P_g and P_data are disjoint (early training), W_1 is finite and provides a meaningful gradient — unlike JS divergence which collapses to a constant. This eliminated mode collapse, vanishing gradients, and finicky hyperparameter tuning. WGAN-GP enforces Lipschitz via gradient penalty: λ·E[(‖∇f(x̂)‖ − 1)²] where x̂ interpolates samples from both distributions. WGAN-style training is now standard in modern generative models, including continuous diffusion variants.

Question 5

What's the difference between W_1 and W_2?

Accepted Answer

W_p uses |x − y|^p as the cost. W_1 (linear cost) is the most common in machine learning — it has the Kantorovich-Rubinstein dual, is computable by linear programming, and is the foundation of WGAN. W_2 (quadratic cost) has rich geometric structure: it makes the space of distributions into a Riemannian-like manifold. Gradient flow in W_2 space corresponds to physical PDEs — heat equation, porous medium equation, Fokker-Planck (Jordan-Kinderlehrer-Otto 1998). McCann's displacement interpolation provides geodesics between distributions in W_2 space, used in fluid dynamics and shape interpolation. For Gaussian distributions N(μ_1, Σ_1) and N(μ_2, Σ_2), W_2² has the closed form ‖μ_1 − μ_2‖² + tr(Σ_1 + Σ_2 − 2(Σ_1^(1/2)Σ_2Σ_1^(1/2))^(1/2)). This is the Bures-Wasserstein metric used in covariance comparison and matrix means.

Question 6

How do you compute Wasserstein distance numerically?

Accepted Answer

For discrete distributions with n and m points, W_p reduces to a linear program: minimize Σ_(i, j) c_(ij) π_(ij) subject to row marginals = μ, column marginals = ν, π ≥ 0. With c_(ij) = |x_i − y_j|^p, this is the optimal transport linear program, solvable in O((n + m)³) time by the network simplex method. For large problems, the Sinkhorn algorithm (Cuturi 2013) uses entropic regularization: solve min π·c + (1/λ)·H(π) where H is entropy. This becomes a matrix scaling problem solvable in O(nm) per iteration with extremely fast convergence — practical for n, m up to 10⁵. The Python POT library implements both. For 1D distributions, W_p has a closed form: W_p(μ, ν)^p = ∫₀¹ |F⁻¹_μ(u) − F⁻¹_ν(u)|^p du, the integrated p-th power of differences between quantile functions. Fast to compute via sorted samples.

Question 7

What's the Word Mover's Distance?

Accepted Answer

Word Mover's Distance (WMD, Kusner et al. 2015) is W_1 applied to documents represented as distributions over word embeddings. Each document is a normalized bag of words; each word has a vector in some embedding space (Word2Vec, GloVe, etc.). The 'cost' to move mass from word w_1 to word w_2 is the Euclidean distance between their embeddings. WMD is then the optimal transport cost between two document distributions. Document similarity becomes geometric: similar documents have low WMD because mass moves short distances in embedding space. WMD outperformed bag-of-words cosine similarity on text retrieval benchmarks and remained competitive with deep neural network similarity scores. The cost is high (O(n³)) for large vocabularies; Relaxed WMD and Sinkhorn variants improve speed.

Property	KL divergence	Wasserstein
Symmetric	No — KL(μ‖ν) ≠ KL(ν‖μ)	Yes — true metric
Disjoint supports	Infinity	Finite (depends on geometry)
Triangle inequality	No	Yes
Sensitive to geometry	No — ignores distances	Yes — distance is the cost
Closed form for Gaussians	Yes	Yes (W₂)
Computable	Closed form when densities exist	Linear program / Sinkhorn

Wasserstein Distance

Watch the 60-second explainer

Piles of dirt

Wasserstein vs KL — why it matters

Kantorovich-Rubinstein duality

WGAN — Wasserstein in deep learning

Worked example — two discrete distributions

Gaussian closed form (Bures-Wasserstein)

Sinkhorn — fast computation

Where Wasserstein appears

Python — discrete optimal transport

Common pitfalls

History

Frequently asked questions