Machine Learning
Normalizing Flows
The only generative model that gives you the exact probability of a sample
A normalizing flow turns a simple distribution like a Gaussian into a complex one by pushing samples through a chain of invertible, differentiable layers, where the change-of-variables formula gives an exact, tractable log-likelihood.
- Likelihoodexact log p(x)
- Latent dimension= data dimension
- Coupling-layer det costO(d)
- Naive det costO(d³)
- InventedTabak & Vanden-Eijnden 2010; Rezende & Mohamed 2015
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The idea: reshape a Gaussian like clay
Suppose you can sample from a boring distribution — a unit Gaussian, a uniform cube — but you actually want samples from something complicated: the distribution of handwritten digits, of molecule conformations, of tomorrow's electricity prices. A normalizing flow says: keep the easy distribution, and learn a function that bends its samples into the shape you want.
Picture the Gaussian as a lump of clay sitting on a table. You squeeze, twist, and stretch it, but you never tear it and you never fold two pieces into the same spot. Every point on the final shape came from exactly one point on the original lump, and you can always trace it back. That "no tearing, no overlap, always reversible" property is what mathematicians call a diffeomorphism — a smooth, invertible map with a smooth inverse. A normalizing flow is a deep neural network constrained to be exactly that kind of map.
The word normalizing runs the map the other way: take a complicated data sample and flow it backward until it lands back inside the clean, "normalized" Gaussian. Flow is the forward direction, composing many small invertible steps. The payoff for accepting the invertibility straitjacket is enormous: unlike a VAE (which only optimizes a lower bound) or a GAN (which has no density at all), a flow gives you the exact probability of any data point, in closed form, by tracking how much the clay was stretched at each point.
The mechanism: change of variables and the log-det Jacobian
Let z be a latent vector drawn from a simple base density pZ(z), usually N(0, I). Let x = f(z) where f is invertible, with inverse z = f⁻¹(x). The change-of-variables formula from multivariable calculus tells us exactly how the density transforms:
p_X(x) = p_Z(f⁻¹(x)) · |det J_{f⁻¹}(x)|
where J_{f⁻¹}(x) = ∂ f⁻¹ / ∂ x (the d×d Jacobian matrix)
The Jacobian determinant is the local volume change. Where the flow squeezes space (the map compresses a region), the density piles up; where it stretches space, the density thins out. Taking logs to keep the numbers stable gives the training objective:
log p_X(x) = log p_Z(z) + log |det (∂z/∂x)| with z = f⁻¹(x)
Now stack K invertible layers, f = f_K ∘ … ∘ f_2 ∘ f_1. Determinants of composed maps multiply, so their logs add:
log p_X(x) = log p_Z(z_0) + Σ_{k=1}^{K} log |det (∂z_{k-1}/∂z_k)|
Train by maximizing this log-likelihood directly with gradient ascent — there is no bound, no adversary, no sampling trick. Sample by drawing z ~ N(0, I) and running the layers forward.
The catch is hiding in det J. For an arbitrary d×d matrix, computing a determinant costs O(d³) by LU decomposition. For a single 256×256 RGB image, d = 196,608, so a naive determinant is roughly 7.6 × 10¹⁵ operations per sample per layer — completely hopeless. Every successful flow architecture is, at heart, a trick to make that determinant cheap. The dominant trick is to force the Jacobian to be triangular, because the determinant of a triangular matrix is just the product of its diagonal — an O(d) computation.
The coupling layer (RealNVP), step by step
The most influential trick is the affine coupling layer from RealNVP (Dinh, Sohl-Dickstein & Bengio, 2017). Split the d dimensions into two halves, x = (x_a, x_b). Pass the first half through untouched, and use it to compute a scale and shift for the second half:
Forward (z → x):
x_a = z_a # identity on first half
s, t = NN(z_a) # any network: deep, convolutional, whatever
x_b = z_b ⊙ exp(s) + t # affine transform of second half
Inverse (x → z):
z_a = x_a
s, t = NN(x_a) # SAME network, SAME inputs — recoverable!
z_b = (x_b − t) ⊙ exp(−s)
The inverse never needs to invert the neural network NN — it only ever runs NN forward, because the network reads only the half that passed through unchanged. That is the whole sleight of hand. The Jacobian is block-triangular:
∂x_a/∂z_a ∂x_a/∂z_b I 0
J = =
∂x_b/∂z_a ∂x_b/∂z_b ∂t/∂z_a diag(exp(s))
The determinant is the product of the diagonal of the lower-right block: det J = Π exp(s_i), so log|det J| = Σ s_i — a single sum over half the dimensions. One coupling layer only transforms half the variables, so flows alternate which half is frozen (a checkerboard or channel mask) and stack many layers. RealNVP also inserts a learned permutation or 1×1 invertible convolution (the headline contribution of Glow, Kingma & Dhariwal 2018) between couplings so that information mixes across all dimensions over depth.
When a flow is the right tool
- You need the actual density, not just samples. Anomaly and out-of-distribution detection, importance sampling, and likelihood-ratio tests all require
p(x). A flow gives it exactly; GANs and diffusion models do not give it cheaply. - You need a flexible posterior in variational inference. The original 2015 paper used flows (planar and radial) to enrich the approximate posterior of a VAE, turning a crude diagonal Gaussian into an arbitrarily expressive distribution.
- Fast, high-quality audio. WaveGlow (trained directly by maximum likelihood) and Parallel WaveNet (an inverse-autoregressive flow distilled from a sequential autoregressive teacher) both generate raw audio in parallel — orders of magnitude faster than autoregressive WaveNet.
- Physics and chemistry sampling. Boltzmann generators use flows to draw equilibrium configurations of molecular systems, where the exact density lets you reweight to the true Boltzmann distribution.
- Density modeling on tabular or moderate-dimension data, where the equal-dimension latent constraint is not a problem.
Reach for something else when raw image fidelity is the only goal (diffusion models dominate), when you want lossy compression to a small latent code (a VAE bottleneck), or when you only need samples and never the density (a GAN is lighter).
Flows vs other deep generative models
| Normalizing flow | VAE | GAN | Diffusion | Autoregressive | |
|---|---|---|---|---|---|
| Exact log-likelihood | Yes (closed form) | No (ELBO lower bound) | No (none) | No (variational/ELBO) | Yes (chain rule) |
| Latent dimension | = data dim (forced) | < data dim (bottleneck) | < data dim | = data dim | n/a |
| Sampling speed | 1 pass (coupling) / d passes (AR) | 1 pass | 1 pass | 10–1000 denoising steps | d sequential steps |
| Training stability | Stable (just MLE) | Stable | Unstable (minimax) | Stable | Stable |
| Image sample quality | Moderate | Blurry | Sharp | State of the art | Sharp but slow |
| Invertible by design | Yes | No | No | No | No |
The defining trade is the equal-dimension latent. It buys the exact likelihood and exact reversibility but forbids any compressing bottleneck, which is why flows lag diffusion on pure image quality yet beat everything else when you genuinely need p(x).
What the numbers actually say
- Determinant cost drops from O(d³) to O(d). For a 256×256 RGB image (
d ≈ 196k), a triangular Jacobian replaces ~10¹⁵ operations with ~10⁵ — about a ten-billion-fold reduction per layer. - Glow on CelebA-HQ used ~200 million parameters across ~600 convolution layers (Kingma & Dhariwal, 2018) to reach competitive 256×256 face samples — far heavier than a comparable GAN, a direct consequence of the no-bottleneck constraint.
- Parallel WaveNet's flow generated 24 kHz audio at over 500,000 samples/second on a GPU — roughly 20× faster than real time and about 1,000× faster than the sequential autoregressive WaveNet it was distilled from.
- RealNVP reported ~3.49 bits/dim on CIFAR-10 (lower is better), versus ~3.35 for Glow — small log-likelihood gaps that nonetheless tracked visibly better samples.
- Continuous flows (FFJORD) estimate the trace of the Jacobian with a single random-vector Hutchinson probe, turning an O(d²) trace into one O(d) vector-Jacobian product, which is what makes free-form (non-triangular) Jacobians affordable at all.
JavaScript implementation
A minimal 2-D affine coupling flow — enough to bend a Gaussian into a banana. The "network" is a tiny tanh MLP; in practice it would be far deeper.
// A single affine coupling layer over 2-D inputs.
// mask = 1 means "pass through and condition on"; mask = 0 means "transform".
class Coupling {
constructor(mask) {
this.mask = mask; // e.g. [1, 0] then [0, 1] in the next layer
// Toy conditioner: maps the frozen coord to a (scale, shift) pair.
this.w = [Math.random() * 0.1, Math.random() * 0.1];
this.b = [0, 0];
}
// Returns [scale s, shift t] from the frozen dimension only.
_st(x) {
const cond = this.mask[0] ? x[0] : x[1]; // the untouched coord
const s = Math.tanh(this.w[0] * cond + this.b[0]); // bounded log-scale
const t = this.w[1] * cond + this.b[1];
return [s, t];
}
// z -> x. Also returns log|det J| = the active scale.
forward(z) {
const [s, t] = this._st(z);
const x = [...z];
const i = this.mask[0] ? 1 : 0; // the transformed coord
x[i] = z[i] * Math.exp(s) + t;
return { x, logDet: s };
}
// x -> z. Same _st call — never inverts the network.
inverse(x) {
const [s, t] = this._st(x);
const z = [...x];
const i = this.mask[0] ? 1 : 0;
z[i] = (x[i] - t) * Math.exp(-s);
return { z, logDet: -s };
}
}
class Flow {
constructor() {
// Alternate which coordinate is frozen so both eventually transform.
this.layers = [new Coupling([1, 0]), new Coupling([0, 1]),
new Coupling([1, 0]), new Coupling([0, 1])];
}
// Exact log-density of a data point under a standard-Gaussian base.
logProb(x) {
let z = x, logDet = 0;
for (let k = this.layers.length - 1; k >= 0; k--) { // run layers in reverse
const r = this.layers[k].inverse(z);
z = r.z; logDet += r.logDet;
}
// log N(z; 0, I) for 2-D plus the accumulated volume change.
const base = -0.5 * (z[0] * z[0] + z[1] * z[1]) - Math.log(2 * Math.PI);
return base + logDet;
}
sample(z) { // pass z ~ N(0, I)
let x = z;
for (const layer of this.layers) x = layer.forward(x).x;
return x;
}
}
const flow = new Flow();
console.log(flow.logProb([0.3, -1.2])); // exact log p(x), no approximation
Two details matter. First, logProb runs the layers in reverse (data → latent) and accumulates the log-determinants, mirroring log p_X(x) = log p_Z(z) + Σ log|det ∂z/∂x|. Second, the inverse calls the exact same conditioner as the forward pass — the network is never inverted, which is the entire reason coupling layers are tractable.
PyTorch implementation
A trainable RealNVP-style coupling layer. The scale and shift come from one MLP whose output is split in two.
import torch, torch.nn as nn
class AffineCoupling(nn.Module):
def __init__(self, dim, hidden=256):
super().__init__()
self.dim = dim
half = dim // 2
self.net = nn.Sequential(
nn.Linear(half, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, 2 * (dim - half)), # outputs s and t stacked
)
def forward(self, z): # z -> x, returns (x, log|det J|)
za, zb = z.chunk(2, dim=1)
s, t = self.net(za).chunk(2, dim=1)
s = torch.tanh(s) # stabilize the log-scale
xb = zb * torch.exp(s) + t
x = torch.cat([za, xb], dim=1)
return x, s.sum(dim=1) # diagonal -> sum is the log-det
def inverse(self, x): # x -> z, returns (z, log|det J^-1|)
xa, xb = x.chunk(2, dim=1)
s, t = self.net(xa).chunk(2, dim=1)
s = torch.tanh(s)
zb = (xb - t) * torch.exp(-s)
z = torch.cat([xa, zb], dim=1)
return z, -s.sum(dim=1)
class RealNVP(nn.Module):
def __init__(self, dim, n_layers=8):
super().__init__()
self.dim = dim
self.couplings = nn.ModuleList(AffineCoupling(dim) for _ in range(n_layers))
# Fixed permutations mix the two halves between couplings.
self.perms = [torch.randperm(dim) for _ in range(n_layers)]
def log_prob(self, x): # exact log-likelihood of data x
log_det = torch.zeros(x.size(0), device=x.device)
z = x
for coupling, perm in zip(reversed(self.couplings), reversed(self.perms)):
z, ld = coupling.inverse(z)
log_det += ld
z = z[:, torch.argsort(perm)] # undo the permutation
base = -0.5 * (z ** 2).sum(1) - 0.5 * self.dim * torch.log(
torch.tensor(2 * torch.pi))
return base + log_det
@torch.no_grad()
def sample(self, n):
z = torch.randn(n, self.dim)
for coupling, perm in zip(self.couplings, self.perms):
z = z[:, perm]
z, _ = coupling.forward(z)
return z
# Training is plain maximum likelihood — minimize negative log-likelihood.
model = RealNVP(dim=2, n_layers=8)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for step in range(2000):
x = sample_target_batch(512) # your data
loss = -model.log_prob(x).mean() # NLL, that's the whole objective
opt.zero_grad(); loss.backward(); opt.step()
Notice there is no reconstruction term, no KL term, and no discriminator — the loss is a single negative-log-likelihood. The permutation between layers is essential: without it, the first half of the vector would be transformed by the first coupling and then frozen forever by the second, so some dimensions would never get conditioned on the others.
Variants worth knowing
Planar and radial flows (Rezende & Mohamed, 2015). The original flows. A planar flow is f(z) = z + u·h(wᵀz + b) — a single hidden unit applied as a perturbation. The matrix-determinant lemma makes its Jacobian determinant O(d). Expressive per layer is weak, so they are mostly of historical and pedagogical interest.
Autoregressive flows — MAF and IAF. Masked Autoregressive Flow (MAF) makes density evaluation a single pass but sampling sequential (O(d)); Inverse Autoregressive Flow (IAF) flips that — fast parallel sampling, slow density. Choose based on whether your bottleneck is training or generation. Both have triangular Jacobians via causal masking, the same idea as autoregressive transformers.
Glow (2018). Replaced RealNVP's fixed permutation with a learned invertible 1×1 convolution and added activation normalization, scaling flows up to convincing high-resolution faces.
Neural spline flows (2019). Swap the affine transform for a monotonic rational-quadratic spline, giving each coupling far more expressive power per layer — fewer layers for the same fit.
Continuous normalizing flows / FFJORD. Define the flow by an ODE dz/dt = f(z, t) and integrate it. The log-density change becomes −∫ tr(∂f/∂z) dt, estimated with the Hutchinson trace trick. This lifts the triangular-Jacobian constraint and is the ancestor of today's flow matching and rectified-flow training that powers modern diffusion-style image and video generators.
Common bugs and edge cases
- Forgetting to accumulate the log-determinant. If you only track the base density and drop the
Σ s_iterms, the model trivially collapses to mapping everything to the origin. The volume term is what penalizes that collapse. - Sign error between forward and inverse log-det. The inverse direction contributes
−s, not+s. Mixing them up produces a loss that decreases while samples get worse. - Not permuting between coupling layers. Without an alternating mask or permutation, half the dimensions are never transformed, so the flow can only model a distribution that is Gaussian in those coordinates.
- Unbounded log-scale. Letting
sgrow without atanhor clamp causesexp(s)to overflow and the log-det to explode; training diverges in a handful of steps. - Treating dequantization as optional. Continuous flows on discrete pixel data (0–255 integers) will assign infinite density to the integer grid unless you add uniform noise first ("dequantization"). Skipping it yields a meaningless, ever-improving log-likelihood.
- Expecting a compressed latent. The latent has the same dimension as the data by construction. If you wanted a small code for compression or interpolation, you wanted a VAE, not a flow.
- Numerical drift in the inverse. Forward then inverse should return the input to within float precision. A large round-trip error usually means a permutation index or a chunk-split mismatch between the two directions.
Frequently asked questions
Why do normalizing flows give an exact likelihood when VAEs and GANs don't?
A flow is an invertible map between a sample x and a latent z. The change-of-variables formula then equates p(x) to p(z) times the absolute determinant of the Jacobian of the inverse. Because the map is a bijection, no part of the probability mass is lost or approximated, so log p(x) is computed exactly. A VAE only has a lower bound (the ELBO), and a GAN has no explicit density at all.
Why must every flow layer be invertible?
Training maximizes log-likelihood, which requires mapping data x backward to latent z (the forward inference direction). Sampling requires mapping z forward to x. A single function can only serve both directions if it is a bijection, so each layer must be invertible. That constraint is the central design tension of every flow architecture.
What makes computing the Jacobian determinant cheap in RealNVP?
An affine coupling layer leaves half the dimensions unchanged and scales/shifts the other half using only the untouched half as input. Its Jacobian is triangular, so the determinant is just the product of the diagonal — the sum of the scale outputs. That turns an O(d³) determinant into an O(d) sum, and the scale/shift networks can be arbitrarily deep without affecting that cost.
What is the difference between a coupling flow and an autoregressive flow?
Both have triangular Jacobians, but an autoregressive flow conditions each dimension on all previous dimensions, so one direction (either sampling or density evaluation) is inherently sequential and costs O(d) passes. A coupling flow splits dimensions into two blocks updated in parallel, so both directions are a single network pass — faster, but each layer transforms fewer dimensions, so you need more layers.
Why are normalizing flows often worse than diffusion models at image quality?
Invertibility forces the latent space to have exactly the same dimension as the data — you cannot compress 196,608 image pixels into a small bottleneck the way a VAE or diffusion model effectively can. Maintaining a smooth bijection across that full dimension is hard, so flows historically produced blurrier samples and need far more parameters per unit of sample quality.
What are continuous normalizing flows?
Instead of stacking discrete layers, a continuous flow defines the transformation by an ordinary differential equation dz/dt = f(z, t) and integrates it with an ODE solver. The log-density change becomes an integral of the trace of the Jacobian, estimated cheaply with the Hutchinson trace estimator. This is the FFJORD model, and it underpins the flow-matching training used in modern image and audio generators.