Optimal Transport

Wasserstein Distance

Earth Mover's Distance — the geometric cost of morphing one distribution into another

W₁(μ, ν) = inf ∫|x − y| dπ(x, y) over couplings — the minimum work to transform μ into ν. Foundation of WGAN, image generation, and optimal transport theory.

  • Primal definitionW_p(μ,ν)^p = inf_π ∫|x−y|^p dπ(x,y)
  • W₁ dualW₁(μ,ν) = sup_(f: 1-Lip) |∫f d(μ−ν)|
  • vs KL divergenceSymmetric, finite even with disjoint supports
  • Gaussian closed formW₂² = ‖μ₁−μ₂‖² + tr(Σ₁+Σ₂−2(Σ₁^(1/2)Σ₂Σ₁^(1/2))^(1/2))
  • Compute viaLinear program O((n+m)³); Sinkhorn O(nm)/iter
  • HistoryMonge 1781, Kantorovich 1942, Vasershtein 1969, WGAN 2017

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Piles of dirt

Imagine two distributions as piles of dirt on a number line. One has 100 kg piled at position 0; the other has 100 kg piled at position 10. To turn the first pile into the second, you must move 100 kg through a distance of 10 — total work 1000 kg-meters.

More generally: μ is a probability distribution, ν is another. The Wasserstein distance W_p(μ, ν) is the minimum work to transport mass from μ to ν, where work for a mass moving distance d is d^p per unit mass. The infimum is taken over all transport plans:

W_p(μ, ν)^p = inf_π ∫ |x - y|^p dπ(x, y)

inf over joint distributions π with marginals μ and ν

A coupling π is any joint distribution on (x, y) whose marginals are μ and ν. Picking the optimal π is the optimal transport problem.

For 1D distributions, the optimal coupling is the quantile coupling: pair the u-th quantile of μ with the u-th quantile of ν. The Wasserstein distance reduces to integrating |F⁻¹_μ(u) − F⁻¹_ν(u)|^p over u ∈ [0, 1]. This makes 1D Wasserstein extremely fast — sort the samples, then compute.

Wasserstein vs KL — why it matters

The KL divergence is the standard way to compare distributions in statistics. But KL has three problems:

PropertyKL divergenceWasserstein
SymmetricNo — KL(μ‖ν) ≠ KL(ν‖μ)Yes — true metric
Disjoint supportsInfinityFinite (depends on geometry)
Triangle inequalityNoYes
Sensitive to geometryNo — ignores distancesYes — distance is the cost
Closed form for GaussiansYesYes (W₂)
ComputableClosed form when densities existLinear program / Sinkhorn

The third row makes Wasserstein indispensable for generative models. Early in GAN training, the generator distribution P_g and target P_data have nearly disjoint supports — KL is infinite, gives no useful gradient. Wasserstein is finite and decreases as P_g moves closer to P_data, providing a meaningful loss signal throughout training.

Kantorovich-Rubinstein duality

The infimum-over-couplings definition is hard to compute directly. The W₁ case has an equivalent dual representation:

W_1(μ, ν) = sup_{f: 1-Lipschitz} |∫ f dμ - ∫ f dν|

A function f is 1-Lipschitz if |f(x) − f(y)| ≤ |x − y| for all x, y. The supremum is over all such f.

This dual is striking: instead of computing optimal transport plans, you compute the worst-case 1-Lipschitz function distinguishing μ from ν. The maximizer f* is the "Kantorovich potential" — geometrically, it's the function pointing in the direction of mass flow.

WGAN — Wasserstein in deep learning

Wasserstein GAN (Arjovsky, Chintala, Bottou, 2017) replaced the JS divergence of vanilla GANs with W₁. The critic network f (replacing the discriminator) is trained to maximize:

L_critic = E_(x~P_data)[f(x)] - E_(x~P_g)[f(x)]    subject to f being 1-Lipschitz

This is exactly the Kantorovich-Rubinstein dual, and the optimum equals W₁(P_data, P_g). The generator G then minimizes the same quantity — pulling P_g toward P_data along the W₁ gradient.

Enforcing the 1-Lipschitz constraint requires care:

  • Weight clipping (original WGAN). Clip each weight to [−c, c]. Crude but effective; can cause vanishing gradients.
  • Gradient penalty (WGAN-GP, Gulrajani et al.). Add λ·E[(‖∇f(x̂)‖ − 1)²] to the loss, with x̂ a random interpolation between real and fake samples. Cleaner and more stable.
  • Spectral normalization (Miyato et al.). Normalize each weight matrix by its largest singular value. Strict bound, computationally cheap.

WGAN's training stability over vanilla GAN was a major step in generative modeling. It directly motivated diffusion models, normalizing flows, and continuous-time generative models — all of which lean on Wasserstein geometry.

Worked example — two discrete distributions

μ has mass at x = 0, 1, 2 with weights (1/3, 1/3, 1/3). ν has mass at y = 5, 6, 7 with weights (1/3, 1/3, 1/3). Find W₁.

By the 1D quantile coupling, pair x = 0 with y = 5, x = 1 with y = 6, x = 2 with y = 7. Each transfers 1/3 of mass through distance 5:

W_1 = (1/3)·5 + (1/3)·5 + (1/3)·5 = 5

The two distributions have identical shapes shifted by 5, so the Wasserstein-1 distance is exactly the shift. Makes intuitive sense: minimum effort to morph one into the other is to translate every grain of dirt 5 units.

For W₂ same example:

W_2^2 = (1/3)·25 + (1/3)·25 + (1/3)·25 = 25
W_2 = 5

For pure translations W₁ = W₂. They diverge when transport involves spreading or compressing mass.

Gaussian closed form (Bures-Wasserstein)

For Gaussian distributions N(μ₁, Σ₁) and N(μ₂, Σ₂):

W_2(N(μ_1, Σ_1), N(μ_2, Σ_2))^2 = ‖μ_1 - μ_2‖^2 + tr(Σ_1 + Σ_2 - 2(Σ_1^(1/2) Σ_2 Σ_1^(1/2))^(1/2))

This is the Bures-Wasserstein formula. It combines the mean displacement with a covariance term that's symmetric in Σ₁ and Σ₂.

Special cases:

  • Σ₁ = Σ₂ = Σ: W₂² = ‖μ₁ − μ₂‖² (pure translation).
  • μ₁ = μ₂ = 0, diagonal Σᵢ: W₂² = Σⱼ (σ_(1j) − σ_(2j))² (sum of standard deviation differences).
  • 1D Gaussians: W₂² = (μ₁ − μ₂)² + (σ₁ − σ₂)².

This closed form is used in covariance matrix interpolation, generative modeling with Gaussian priors, and shape analysis.

Sinkhorn — fast computation

For discrete distributions with n and m points, exact W_p is a linear program — solvable but slow for large n, m. Cuturi (2013) introduced the Sinkhorn algorithm, which adds an entropy regularizer to the transport plan:

min_π   <π, C>  -  (1/λ) H(π)        subject to marginals

This entropic problem has a unique solution computable by matrix scaling (Sinkhorn-Knopp iterations). Each iteration is O(nm) instead of O((n + m)³) for exact transport. The Sinkhorn distance S_λ converges to W_p as λ → ∞. With λ moderate, Sinkhorn is differentiable (used for end-to-end training of neural networks with transport loss).

For 2D image distributions with n = m = 10⁴ pixels, exact transport takes hours; Sinkhorn finishes in seconds. This is what made optimal transport practical at scale.

Where Wasserstein appears

  • Wasserstein GAN. Stable generative model training. Foundation for image synthesis and style transfer.
  • Diffusion models. Score-based generative models can be viewed as W₂ gradient flow toward the data distribution.
  • Domain adaptation. Align distributions across source and target domains by minimizing Wasserstein distance.
  • Word Mover's Distance. Document similarity using W₁ on bags of word embeddings (Kusner et al. 2015).
  • Distributional reinforcement learning. Algorithms like QR-DQN (Quantile Regression DQN) and IQN use Wasserstein-aware value-distribution updates.
  • Color transfer / image processing. Match color distributions between images via optimal transport.
  • Shape analysis. Compare anatomical shapes by treating them as point cloud distributions; W₂ gives Bures-Wasserstein-style mean shapes.
  • Climate science. Compare modeled vs observed weather distributions, satellite versus reanalysis fields.
  • PDE analysis. Heat equation, Fokker-Planck, porous medium equation are all gradient flows in W₂ (Jordan-Kinderlehrer-Otto 1998).
  • Topological data analysis. Comparing persistence diagrams uses a Wasserstein-like distance.

Python — discrete optimal transport

import numpy as np
from scipy.stats import wasserstein_distance
import ot   # Python Optimal Transport library

# 1D Wasserstein-1 via scipy (closed-form via sorted samples)
mu = np.random.normal(0, 1, 1000)
nu = np.random.normal(3, 1, 1000)
w1 = wasserstein_distance(mu, nu)
print(f'W_1 (scipy 1D) = {w1:.3f}')   # ≈ 3.0 (pure shift)

# Multi-dim: linear program via POT
# Two distributions in 2D
xs = np.random.randn(100, 2)
xt = np.random.randn(100, 2) + np.array([5.0, 0.0])
a = np.ones(100) / 100   # uniform weights
b = np.ones(100) / 100
M = ot.dist(xs, xt, metric='euclidean')   # 100×100 cost matrix
T = ot.emd(a, b, M)   # optimal transport plan
w1_exact = (T * M).sum()
print(f'W_1 (POT, exact) = {w1_exact:.3f}')

# Sinkhorn — much faster for large n
T_sinkhorn = ot.sinkhorn(a, b, M, reg=0.1)
w1_sinkhorn = (T_sinkhorn * M).sum()
print(f'Sinkhorn approximation = {w1_sinkhorn:.3f}')

# Gaussian closed form for W_2
def w2_gaussian(mu1, Sigma1, mu2, Sigma2):
    S1_sqrt = np.linalg.cholesky(Sigma1) if Sigma1.ndim > 0 else np.sqrt(Sigma1)
    inner = S1_sqrt @ Sigma2 @ S1_sqrt
    inner_sqrt = np.linalg.cholesky(inner + 1e-12 * np.eye(len(inner)))
    cov_term = np.trace(Sigma1 + Sigma2 - 2 * inner_sqrt)
    mean_term = np.sum((mu1 - mu2) ** 2)
    return np.sqrt(mean_term + cov_term)

w2 = w2_gaussian(np.zeros(2), np.eye(2), np.array([5., 0.]), np.eye(2))
print(f'W_2 (Gaussian, pure shift) = {w2:.3f}')   # ≈ 5.0

Common pitfalls

  • Confusing W_p with KL. W_p is symmetric and finite; KL is asymmetric and can be infinite. They measure different things — W_p is geometric, KL is informational.
  • Choosing the wrong cost function. Euclidean for spatial data, embedding-distance for text, Manhattan for histograms with grid structure. The cost shape changes the optimal transport entirely.
  • Forgetting normalization. Wasserstein assumes both inputs are probability measures (total mass = 1). For unnormalized signals, use unbalanced optimal transport or normalize first.
  • Sinkhorn regularization bias. Entropic Sinkhorn underestimates true Wasserstein for moderate λ. The difference vanishes as λ → ∞ but at λ = 0.1 it can be substantial.
  • Curse of dimensionality. Empirical W_p between n samples in d dimensions has error O(n^(−1/d)) — slow convergence in high d. Estimators like sliced Wasserstein and entropic Sinkhorn mitigate this.
  • Lipschitz violations in WGAN. Forgetting to enforce 1-Lipschitz on the critic (or using wrong λ in gradient penalty) breaks the dual formulation. The reported "Wasserstein loss" is no longer a true Wasserstein distance.

History

Gaspard Monge posed the original optimal transport problem in 1781 — "How to move a pile of earth to fill a hole using least work?" He couldn't solve it generally. Leonid Kantorovich relaxed the problem from deterministic maps to probability couplings (1942), proving existence and duality — this earned him the 1975 Nobel in economics for the closely related work on resource allocation. Leonid Vasershtein (1969) studied the metric properties of what later became known as the Wasserstein distance (the spelling "Wasserstein" is the German transliteration adopted in the Western literature). Cédric Villani's 2003 and 2009 books became the canonical references; Villani won the 2010 Fields Medal partly for this work. The deep learning revolution came with Wasserstein GAN (Arjovsky et al. 2017), which showed Wasserstein-1 was the right loss for generative modeling. Cuturi's Sinkhorn algorithm (2013) made large-scale computation practical. Marco Cuturi and Gabriel Peyré's 2019 book "Computational Optimal Transport" is the modern reference for ML practitioners.

Frequently asked questions

What is the Earth Mover's Distance intuitively?

Imagine two distributions as piles of dirt arranged on a 1D line. The Earth Mover's Distance is the minimum amount of work — mass times distance moved — needed to transform one pile into the other. For two equal-mass piles, you must move every grain from where it is in pile μ to where it should be in pile ν, and the optimal plan minimizes the total work. The Wasserstein-1 distance is this optimal cost. Wasserstein-p uses distance raised to the p-th power as the cost. The Earth Mover's name was coined by Rubner et al. (2000) in the image retrieval community.

Why is Wasserstein better than KL divergence?

KL divergence has three problems Wasserstein avoids. First, KL is asymmetric. Second, KL is infinite when μ and ν have disjoint supports. Third, KL ignores the geometry of the underlying space. Wasserstein is symmetric (a metric), is finite for any distributions with finite p-th moments regardless of support, and respects geometry (W(δ_x, δ_y) = |x − y|, the actual distance between point masses). This makes Wasserstein a natural choice for training generative models like WGAN, where the model and target distribution may have non-overlapping supports during early training.

What is the Kantorovich-Rubinstein duality?

Kantorovich-Rubinstein duality is the variational formula W_1(μ, ν) = sup over 1-Lipschitz functions f of |∫f d(μ − ν)|. Computing the infimum over couplings (the primal problem) is hard for general distributions. Computing the supremum over Lipschitz functions is more tractable — and that's exactly what Wasserstein GAN does. The discriminator (critic) in WGAN learns a 1-Lipschitz function approximating f*, the supremum maximizer, and reports back the Wasserstein distance estimate to drive the generator's update.

How is Wasserstein used in WGAN training?

Wasserstein GAN replaces the original GAN's JS divergence with the Wasserstein-1 distance between the generator distribution P_g and target P_data. The critic maximizes E_(x~P_data)[f(x)] − E_(x~P_g)[f(x)] subject to f being 1-Lipschitz — the dual form of W_1. Crucially, even when P_g and P_data are disjoint (early training), W_1 is finite and provides a meaningful gradient — unlike JS divergence which collapses to a constant. This eliminated mode collapse and finicky hyperparameter tuning. WGAN-GP enforces Lipschitz via gradient penalty.

What's the difference between W_1 and W_2?

W_p uses |x − y|^p as the cost. W_1 (linear cost) is the most common in machine learning — it has the Kantorovich-Rubinstein dual, is computable by linear programming, and is the foundation of WGAN. W_2 (quadratic cost) has rich geometric structure: gradient flow in W_2 space corresponds to physical PDEs. For Gaussian distributions, W_2² has the closed form ‖μ_1 − μ_2‖² + tr(Σ_1 + Σ_2 − 2(Σ_1^(1/2)Σ_2Σ_1^(1/2))^(1/2)) — the Bures-Wasserstein metric.

How do you compute Wasserstein distance numerically?

For discrete distributions with n and m points, W_p reduces to a linear program: minimize Σ_(i, j) c_(ij) π_(ij) subject to marginals. With c_(ij) = |x_i − y_j|^p, this is solvable in O((n + m)³) time by network simplex. For large problems, the Sinkhorn algorithm (Cuturi 2013) uses entropic regularization, solvable in O(nm) per iteration with fast convergence. For 1D distributions, W_p has a closed form: ∫₀¹ |F⁻¹_μ(u) − F⁻¹_ν(u)|^p du. Fast to compute via sorted samples.

What's the Word Mover's Distance?

Word Mover's Distance (WMD, Kusner et al. 2015) is W_1 applied to documents represented as distributions over word embeddings. Each document is a normalized bag of words; each word has a vector in some embedding space. The 'cost' to move mass from word w_1 to word w_2 is the Euclidean distance between their embeddings. WMD is then the optimal transport cost between two document distributions. Document similarity becomes geometric: similar documents have low WMD because mass moves short distances in embedding space.