Machine Learning
Variational Autoencoder (VAE)
The autoencoder that learns a latent space you can sample from
A variational autoencoder (VAE) is a generative neural network that encodes data into a probability distribution over a continuous latent space, then decodes samples from it — trained by maximizing the evidence lower bound (ELBO) so the latent space stays smooth enough to generate new data.
- IntroducedKingma & Welling, 2013
- Loss−ELBO = recon + KL
- Latent priorN(0, I)
- Key trickz = μ + σ·ε
- Sampling cost1 forward pass
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a VAE works
A regular autoencoder squeezes an image down to a few numbers — a code — and then expands the code back into the image. Train it on 60,000 handwritten digits and it learns to compress each one into, say, 16 numbers and reconstruct it. Useful for compression and denoising. Useless for generation: pick 16 random numbers, decode them, and you get noise. The encoder only ever learned to place real digits at scattered points, and it has no idea what lives in the gaps between them.
The variational autoencoder, introduced by Diederik Kingma and Max Welling in their December 2013 paper Auto-Encoding Variational Bayes, fixes the gaps. Instead of mapping an input to a single point, the encoder maps it to a probability distribution — a small Gaussian blob with a mean μ and a standard deviation σ. To decode, you sample a point from that blob and feed it to the decoder. Because the same input now lands on a fuzzy cloud of nearby points (all of which must decode to roughly the same digit), the decoder is forced to make the region around each code meaningful too. Fill enough of the space with overlapping blobs and you get a smooth, continuous latent space where every point decodes to something plausible.
That continuity is the whole payoff. Once the space is smooth you can do three things a plain autoencoder can't: sample a fresh point from the prior and generate a brand-new digit; interpolate along a line between two codes and watch one digit morph into another; and do arithmetic in latent space ("smiling face" − "neutral face" + "man" ≈ "smiling man").
The mechanism: ELBO, KL, and the reparameterization trick
A VAE is a probabilistic model. It assumes data x is generated from a hidden latent variable z drawn from a simple prior p(z) = N(0, I), then passed through a decoder p(x|z). We'd love to maximize the data likelihood p(x) = ∫ p(x|z)p(z) dz, but that integral is intractable. So we introduce an encoder q(z|x) that approximates the true posterior, and optimize a tractable lower bound on log p(x) instead — the Evidence Lower Bound (ELBO):
log p(x) ≥ ELBO(x) = E_q(z|x)[ log p(x|z) ] − KL( q(z|x) ‖ p(z) )
\_______________________/ \________________/
reconstruction term regularizer
KL( N(μ, σ²) ‖ N(0, 1) ) = −½ Σ_j ( 1 + log σ²_j − μ²_j − σ²_j )
The two terms pull against each other. The reconstruction term wants the decoded output to match the input, which encourages the encoder to spread codes far apart so they're easy to tell apart. The KL term wants every encoded blob to look like the standard normal prior — centered at the origin with unit variance — which packs the codes together and overlapping. The equilibrium between "spread out to reconstruct" and "pack together to match the prior" is exactly what produces a useful, gap-free latent space.
The catch: training needs gradients, and you can't backpropagate through a random sampling step. The reparameterization trick solves this. Instead of sampling z ~ N(μ, σ²) directly, write it as a deterministic function of a fixed-distribution noise variable:
ε ~ N(0, 1) # randomness lives here, no gradient needed
z = μ + σ · ε # deterministic in μ and σ — gradients flow through
Now μ and σ are ordinary network outputs and the gradient ∂z/∂μ = 1, ∂z/∂σ = ε are well-defined. The randomness has been pushed out of the path the gradient travels. In practice the network outputs log σ² (the log-variance) rather than σ, because it's numerically stable and unconstrained — variance must be positive, but a log can be any real number.
Cost and complexity
A VAE's compute cost is dominated by the encoder and decoder forward/backward passes, exactly like any other feed-forward network — there is no expensive iterative inference at training or generation time, which is the headline advantage over older Bayesian generative models.
- Generation: a single forward pass. Sample
z ~ N(0, I)(microseconds) and run the decoder once. For a small MNIST decoder that's well under a millisecond on a GPU; contrast with a diffusion model that needs 50–1000 sequential denoising steps to produce one image. - The reparameterized loss is unbiased with a single sample. You don't need a Monte-Carlo average of many
zdraws per input — oneεper example per step gives an unbiased gradient estimate, so the per-step cost is the same as a plain autoencoder. - The KL term is closed-form. For diagonal Gaussians it's the formula above — a cheap elementwise sum over the latent dimensions, no sampling required for the regularizer.
- Latent dimension is a knob, not a cost driver. MNIST works with 2–20 latent dims; a 256×256 face VAE might use 128–512. The decoder, not the latent size, dominates FLOPs.
When to choose a VAE
- You need a structured, navigable latent space — interpolation, attribute arithmetic, smooth morphing, or a continuous representation for downstream control. This is where VAEs beat GANs.
- You need a likelihood or anomaly score. The ELBO gives a (lower bound on)
log p(x), so low-ELBO inputs flag out-of-distribution data — useful for anomaly detection. - You need fast, single-shot generation and can tolerate softer samples — drug-molecule generation, latent representations for reinforcement learning, fast prototyping.
- Stable, mode-covering training. VAEs optimize a single likelihood objective and almost never suffer the training collapse or mode-dropping that plagues GANs.
If your only goal is photorealistic sharpness, reach for a GAN or a diffusion model. If you need both crisp output and a latent space, the modern answer is a latent diffusion model — which literally runs a diffusion process inside a VAE's latent space (that's the architecture behind Stable Diffusion).
VAE vs other generative models
| VAE | Plain autoencoder | GAN | Diffusion model | Normalizing flow | |
|---|---|---|---|---|---|
| Latent space | Smooth, continuous, samplable | Holes between codes | Samplable but tangled | Noise schedule, not a code | Smooth, invertible |
| Training objective | Maximize ELBO | Reconstruction only | Adversarial minimax | Denoising score matching | Exact log-likelihood |
| Exact likelihood? | Lower bound only | No | No | Lower bound (variational) | Yes, exact |
| Sample quality | Soft / blurry | N/A (no sampling) | Sharp | State of the art | Good |
| Generation cost | 1 forward pass | N/A | 1 forward pass | 50–1000 steps | 1 pass (but constrained arch) |
| Training stability | Very stable | Very stable | Fragile (collapse, mode drop) | Stable | Stable |
| Interpolation | Smooth & meaningful | Decodes garbage | Possible, less reliable | Not native | Smooth |
The defining trade-off: a VAE explicitly models the latent distribution and pays for it with softer samples; a GAN ignores the latent distribution and buys sharpness with a fragile training loop. Diffusion models won the photorealism race but at 50–1000× the per-sample cost, which is precisely why latent diffusion puts the diffusion process inside a VAE — to shrink the data first.
What the numbers actually say
- Original results, 2013. Kingma & Welling reported a marginal log-likelihood around −96 nats on binarized MNIST with a modest fully-connected VAE — competitive with the best methods of the day, at a fraction of the inference cost.
- The blur is averaging, not bad training. A Gaussian likelihood decoder is an L2 loss. When a code is ambiguous between, say, a 4 and a 9, L2 is minimized by outputting the pixel-wise average of both — a fuzzy ghost. This is a property of the objective, not a bug.
- β-VAE: one number tunes disentanglement. Multiply the KL term by β > 1 and the latent dimensions specialize — one for rotation, one for width, one for thickness. Higgins et al.'s 2017 β-VAE paper showed β ≈ 4 gives clean disentanglement on dSprites at the cost of reconstruction fidelity.
- Posterior collapse is measurable. Track per-dimension KL; a dimension whose KL stays near 0 is "dead." On strong autoregressive decoders, the majority of latent dims can collapse without intervention.
JavaScript: the loss in one place
The two parts every VAE implementation must get right are the reparameterization trick and the KL term. Here they are in plain JS, computing the per-example loss given encoder outputs mu and logVar:
// Box–Muller standard-normal sample
function randn() {
const u = Math.random(), v = Math.random();
return Math.sqrt(-2 * Math.log(u)) * Math.cos(2 * Math.PI * v);
}
// Reparameterization: z = mu + sigma * eps, with eps ~ N(0,1)
function reparameterize(mu, logVar) {
return mu.map((m, j) => {
const sigma = Math.exp(0.5 * logVar[j]); // sigma = exp(½ log σ²)
return m + sigma * randn();
});
}
// Closed-form KL( N(mu, σ²) ‖ N(0,1) ), summed over latent dims
function klDivergence(mu, logVar) {
let kl = 0;
for (let j = 0; j < mu.length; j++) {
kl += -0.5 * (1 + logVar[j] - mu[j] * mu[j] - Math.exp(logVar[j]));
}
return kl;
}
// Bernoulli reconstruction loss (binary cross-entropy) for [0,1] pixels
function bce(x, xHat) {
let r = 0;
for (let i = 0; i < x.length; i++) {
const p = Math.min(Math.max(xHat[i], 1e-7), 1 - 1e-7);
r -= x[i] * Math.log(p) + (1 - x[i]) * Math.log(1 - p);
}
return r;
}
// Total VAE loss = reconstruction + beta * KL (beta = 1 is the vanilla VAE)
function vaeLoss(x, xHat, mu, logVar, beta = 1) {
return bce(x, xHat) + beta * klDivergence(mu, logVar);
}
Note logVar everywhere: the network never outputs σ directly. σ = exp(½·logVar) is always positive and the log keeps the optimization unconstrained and numerically stable.
Python: a full VAE in PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, in_dim=784, hidden=400, latent=20):
super().__init__()
self.fc1 = nn.Linear(in_dim, hidden)
self.fc_mu = nn.Linear(hidden, latent) # outputs mu
self.fc_lv = nn.Linear(hidden, latent) # outputs log-variance
self.fc2 = nn.Linear(latent, hidden)
self.fc3 = nn.Linear(hidden, in_dim)
def encode(self, x):
h = F.relu(self.fc1(x))
return self.fc_mu(h), self.fc_lv(h) # mu, logvar
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar) # sigma = exp(½ logσ²)
eps = torch.randn_like(std) # eps ~ N(0, I)
return mu + std * eps # the trick
def decode(self, z):
h = F.relu(self.fc2(z))
return torch.sigmoid(self.fc3(h)) # pixels in [0,1]
def forward(self, x):
mu, logvar = self.encode(x)
z = self.reparameterize(mu, logvar)
return self.decode(z), mu, logvar
def vae_loss(x_hat, x, mu, logvar, beta=1.0):
# reconstruction: summed binary cross-entropy
recon = F.binary_cross_entropy(x_hat, x, reduction='sum')
# closed-form KL to N(0, I)
kld = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon + beta * kld
# --- training step ---
model = VAE()
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for x, _ in loader: # x: [batch, 784] in [0,1]
x_hat, mu, logvar = model(x)
loss = vae_loss(x_hat, x, mu, logvar)
opt.zero_grad(); loss.backward(); opt.step()
# --- generate brand-new samples (no input needed) ---
with torch.no_grad():
z = torch.randn(64, 20) # sample the prior directly
new_images = model.decode(z) # 64 fresh digits, one forward pass
The generation step is the proof of the whole idea: we never touch a real image. We draw 64 vectors straight from the prior N(0, I) and decode them. Because the KL term trained the latent space to look like that prior, these random points land in populated regions and decode to recognizable digits.
Variants worth knowing
β-VAE. Scale the KL term by a coefficient β. With β > 1 the model trades reconstruction fidelity for a more disentangled latent space — individual dimensions come to control single, interpretable factors of variation. The standard VAE is just β-VAE with β = 1.
Conditional VAE (CVAE). Feed a label y into both encoder and decoder, so you can ask for "a 7" rather than getting a random digit. The model learns p(x | z, y) — controllable generation.
VQ-VAE. Replace the continuous Gaussian latent with a discrete codebook (vector quantization). It avoids posterior collapse and produces sharper output; VQ-VAE-2 generated ImageNet samples competitive with GANs, and the idea underpins many modern tokenized image/audio models.
Importance-Weighted Autoencoder (IWAE). Use k latent samples and a tighter importance-weighted bound, giving a strictly better likelihood estimate than the single-sample ELBO at the cost of k× the compute.
Latent diffusion. Train a VAE to compress images into a small latent grid, then run a diffusion model in that latent space rather than pixel space. This is the architecture behind Stable Diffusion — the VAE provides the cheap, smooth space; diffusion provides the photorealism.
Common bugs and edge cases
- Outputting σ instead of log σ². A raw σ head can go negative or explode. Always output log-variance and recover
σ = exp(½·logVar); the KL formula above assumes log-variance. - Mismatched reduction in the loss. If reconstruction is summed over pixels but KL is averaged over the batch (or vice versa), the β you think you're using is wrong by a factor of the batch size or pixel count. Keep both terms on the same scale.
- Posterior collapse. A too-powerful decoder learns to ignore
zwhile the KL drivesq(z|x)to the prior. Fight it with KL annealing (ramp β from 0 to 1 over training) or free-bits (don't penalize KL below a small floor per dimension). - Sigmoid + BCE only works for [0,1] data. For unbounded or non-binary pixels use a Gaussian or discretized-logistic likelihood; forcing a sigmoid on raw pixel values distorts the loss.
- Sampling the prior with the wrong scale. At generation you draw
z ~ N(0, I)— the prior — not from the encoder. Reusing an encoded μ "samples" only known data; the point of a VAE is that the standard-normal prior already covers the populated space. - Expecting GAN-sharp output. Blur is inherent to the L2/Gaussian likelihood, not a training failure. If you need sharpness, switch the decoder likelihood (VQ-VAE) or move to latent diffusion.
Frequently asked questions
What is the difference between a VAE and a plain autoencoder?
A plain autoencoder maps each input to a single point in latent space and only minimizes reconstruction error, so the space between encoded points is undefined garbage. A VAE encodes each input to a Gaussian distribution and adds a KL term pulling those distributions toward a standard normal, which forces the latent space to be smooth and continuous — so sampling a new point and decoding it produces a plausible new datum.
What is the reparameterization trick and why is it needed?
Sampling z from N(μ, σ²) is a random operation you can't backpropagate through. The reparameterization trick rewrites the sample as z = μ + σ·ε where ε ~ N(0, 1) is drawn independently of the network. The randomness now lives in ε, while μ and σ are deterministic outputs, so gradients flow straight to the encoder. It was the key contribution of Kingma and Welling's 2013 paper.
What is the ELBO and what are its two terms?
The Evidence Lower Bound is the loss a VAE maximizes: ELBO = E[log p(x|z)] − KL(q(z|x) ‖ p(z)). The first term is reconstruction quality (decode z back to x). The second is a regularizer pulling the encoder's per-input distribution toward the prior N(0, I). Maximizing the ELBO is equivalent to minimizing reconstruction loss plus a KL penalty.
Why are VAE samples blurrier than GAN samples?
VAEs typically maximize a per-pixel Gaussian likelihood, which is mathematically an L2 reconstruction loss. When several plausible outputs exist, L2 minimizes error by averaging them, producing a blurry mean image. GANs instead use a discriminator that rewards sharp, realistic-looking output, so they produce crisper but less mode-covering samples.
What is posterior collapse in a VAE?
Posterior collapse happens when the KL term drives the encoder to output q(z|x) ≈ the prior for every input — the latent code carries no information and the decoder ignores z entirely. It's common when the decoder is very powerful (e.g. an autoregressive PixelCNN or LSTM). Fixes include KL annealing, free-bits, and β-VAE scheduling.
Can you interpolate between two data points in a VAE?
Yes — that's the headline feature. Encode two inputs to their latent means z₁ and z₂, then decode points along the line (1−t)·z₁ + t·z₂ for t from 0 to 1. Because the KL term keeps the space continuous, the decoded outputs morph smoothly — a face rotating, a digit 3 bending into an 8 — instead of jumping between unrelated images.