Machine Learning

Diffusion Models

Teach a network to undo noise, then hand it static and watch a picture appear

A diffusion model generates data by learning to reverse a step-by-step noising process: it trains a neural network to predict the noise added at each timestep, then starts from pure Gaussian noise and denoises it back into a clean image over hundreds of steps.

Forward processadd Gaussian noise · fixed
Reverse processlearned denoiser · iterative
Training lossMSE on predicted noise
DDPM sampling steps1000 (DDIM: 10–50)
NetworkU-Net or DiT, time-conditioned

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a diffusion model works

A diffusion model is built from two processes that mirror each other. The forward process is a fixed, hand-designed recipe that gradually destroys an image by adding a little Gaussian noise at each of T timesteps until, at step T, nothing is left but pure static. The reverse process is what you actually learn: a neural network that, given a noisy image and the timestep it came from, predicts how to undo one step of that corruption. Run the reverse process from t = T down to t = 0 and a clean sample crystallizes out of noise.

The forward process is the clever part, because it never needs training. At each step you blend the current image with fresh noise according to a variance schedule β₁ … β_T. The key algebraic trick is that you can jump directly to any timestep in closed form — you don't have to simulate all the intermediate steps:

x_t = √(ᾱ_t) · x₀ + √(1 − ᾱ_t) · ε        where ε ~ N(0, I)
ᾱ_t = ∏_{s=1..t} (1 − β_s)

So a noisy sample x_t is just a known weighted average of the clean image x₀ and a single noise vector ε. That closed form is why training is cheap: pick a random image, pick a random timestep t, sample one noise vector, compute x_t in one line, and ask the network to predict the ε you used. The loss is a plain mean-squared error between the true noise and the predicted noise.

At sampling time you reverse it. Start from x_T ~ N(0, I) — literally random static — and at each step subtract the network's predicted noise (scaled), then add back a small amount of fresh noise to keep the chain stochastic. After T steps you land on x₀. The network never sees a clean target during generation; it only ever nudges the current estimate slightly toward the data manifold, one small denoising step at a time.

The score-matching view

There are two equivalent ways to understand what the network learns, and seeing both is the fastest route to intuition. The first is the DDPM "predict the noise" framing above. The second, from the score-based literature (Song & Ermon, 2019), says the network learns the score — the gradient of the log data density, ∇ₓ log p(x). The score points toward regions of higher probability, so following it uphill moves a noisy point toward where real data lives.

These views coincide exactly: predicting the noise ε is, up to a known scaling factor, the same as estimating the score of the noised distribution. That equivalence is why diffusion models are sometimes called score-based generative models, and why the continuous-time formulation expresses both forward and reverse processes as stochastic differential equations. For practical purposes you can think of generation as noisy gradient ascent on the data density — the same conceptual move as gradient descent, run in reverse and in the space of images.

When to use a diffusion model

High-fidelity, diverse generation. Diffusion currently leads on image quality and, crucially, on mode coverage — it doesn't collapse onto a few outputs the way GANs can.
Stable training. The loss is a simple regression; there's no adversarial min-max game to balance, no discriminator to tune. This is the single biggest practical reason diffusion overtook GANs.
Conditional generation. Text-to-image, inpainting, super-resolution, and image-to-image all drop in cleanly by conditioning the denoiser, without retraining from scratch.
Beyond images. The same machinery generates audio (waveforms, spectrograms), video, 3D shapes, molecules, and robot trajectories.

Avoid diffusion when latency is the constraint and a single forward pass matters — real-time generation on the edge, or any setting where you can't afford dozens of network evaluations per sample. There a GAN, a VAE, or a distilled few-step diffusion model fits better. Diffusion's Achilles' heel is and always has been sampling speed.

Diffusion vs other generative models

	Diffusion (DDPM)	GAN	VAE	Normalizing flow	Autoregressive
Sampling cost	10–1000 passes	1 pass	1 pass	1 pass	1 pass per token
Training stability	High (MSE loss)	Low (adversarial)	High	High	High
Sample quality	State of the art	High, can be sharp	Blurry	Moderate	High
Mode coverage	Excellent	Poor (collapse)	Good	Excellent	Excellent
Exact likelihood	Variational bound	None	Variational bound	Exact	Exact
Architecture freedom	Any (U-Net, DiT)	Any	Any	Invertible only	Causal only
Latent dimension	= data dimension	Small bottleneck	Small bottleneck	= data dimension	= data dimension

The headline trade is quality and stability versus speed. Diffusion wins decisively on the first two and loses on the third; almost all recent research on diffusion sampling is an attempt to claw back that single-pass speed without giving up quality.

What the numbers actually say

1000 sequential passes per image, by default. The original 2020 DDPM samples with T = 1000 denoising steps. On a single image that is 1000 forward passes through a U-Net with tens to hundreds of millions of parameters — seconds to minutes of GPU time, versus a GAN's sub-second single pass.
DDIM cuts steps 20–100×. The deterministic DDIM sampler (2021) produces comparable quality in 50 steps, and usable images in 10–20, by reformulating the reverse process as non-Markovian. DPM-Solver and other higher-order ODE solvers push good results into the 10–25 step range.
Latent diffusion shrinks the working tensor ~48×. Stable Diffusion encodes a 512×512×3 image (≈786k numbers) into a 64×64×4 latent (≈16k numbers) before diffusing, an order-of-magnitude cut in both compute and memory that put text-to-image on consumer GPUs.
Distillation reaches 1–4 steps. Progressive distillation, consistency models, and adversarial diffusion distillation compress the 50-step sampler into 1–4 passes, approaching GAN latency while keeping diffusion's diversity.
Training scale. Stable Diffusion v1 trained for roughly 150,000 A100 GPU-hours — on the order of thousands of GPU-days — across the LAION dataset (billions of image-text pairs); the published v1 model carried ~860M U-Net parameters plus its text encoder and autoencoder.

JavaScript implementation

A real diffusion model needs a deep network and a GPU, but the sampling loop is short and language-agnostic. Here is the DDPM sampler skeleton in JavaScript, with the denoiser stubbed out — swap in any model that maps (x_t, t) → predicted noise.

// Build a linear beta schedule and precompute the alpha cumulative products.
function makeSchedule(T = 1000, beta0 = 1e-4, betaT = 0.02) {
  const beta = [], alpha = [], alphaBar = [];
  let cum = 1;
  for (let t = 0; t < T; t++) {
    const b = beta0 + (betaT - beta0) * (t / (T - 1));
    beta.push(b); alpha.push(1 - b);
    cum *= (1 - b); alphaBar.push(cum);
  }
  return { beta, alpha, alphaBar, T };
}

const randn = () => {                 // Box–Muller standard normal
  const u = Math.random(), v = Math.random();
  return Math.sqrt(-2 * Math.log(u)) * Math.cos(2 * Math.PI * v);
};

// model(xt, t) must return predicted noise eps_hat with the same shape as xt.
function ddpmSample(model, shape, sched) {
  const { beta, alpha, alphaBar, T } = sched;
  let x = Array.from({ length: shape }, randn);   // start from pure noise

  for (let t = T - 1; t >= 0; t--) {
    const epsHat = model(x, t);
    const aBar = alphaBar[t], a = alpha[t], b = beta[t];
    const coef = b / Math.sqrt(1 - aBar);
    const invSqrtA = 1 / Math.sqrt(a);
    // mean of p(x_{t-1} | x_t)
    const mean = x.map((xi, i) => invSqrtA * (xi - coef * epsHat[i]));
    if (t === 0) { x = mean; break; }            // no noise on the last step
    const sigma = Math.sqrt(b);                   // simple fixed variance
    x = mean.map(m => m + sigma * randn());
  }
  return x;                                       // approximately x0
}

Two details carry the whole algorithm. First, the final step adds no noise — sampling a clean output means stopping the stochasticity at t = 0. Second, the coefficient b / √(1 − ᾱ_t) is exactly the scaling that converts a noise prediction into the right step size; get it wrong and samples drift to gray or blow up.

Python implementation

The same sampler in PyTorch, plus the one-line training objective that is the real heart of DDPM. The training loop is shorter than people expect: noise a real image to a random timestep, predict the noise, regress.

import torch

def make_schedule(T=1000, beta0=1e-4, betaT=0.02, device="cpu"):
    beta = torch.linspace(beta0, betaT, T, device=device)
    alpha = 1.0 - beta
    alpha_bar = torch.cumprod(alpha, dim=0)
    return beta, alpha, alpha_bar

# ---- Training step (the entire learning objective) ----
def training_loss(model, x0, schedule):
    beta, alpha, alpha_bar = schedule
    T = beta.shape[0]
    t = torch.randint(0, T, (x0.shape[0],), device=x0.device)   # random timestep
    eps = torch.randn_like(x0)                                  # target noise
    ab = alpha_bar[t].view(-1, 1, 1, 1)
    x_t = ab.sqrt() * x0 + (1 - ab).sqrt() * eps                # closed-form noising
    eps_hat = model(x_t, t)                                     # predict the noise
    return torch.nn.functional.mse_loss(eps_hat, eps)          # plain MSE

# ---- Sampling (reverse process) ----
@torch.no_grad()
def ddpm_sample(model, shape, schedule, device="cpu"):
    beta, alpha, alpha_bar = schedule
    T = beta.shape[0]
    x = torch.randn(shape, device=device)                       # pure noise
    for t in reversed(range(T)):
        t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)
        eps_hat = model(x, t_b)
        a, ab, b = alpha[t], alpha_bar[t], beta[t]
        coef = b / (1 - ab).sqrt()
        mean = (x - coef * eps_hat) / a.sqrt()
        if t > 0:
            x = mean + b.sqrt() * torch.randn_like(x)
        else:
            x = mean                                            # last step: no noise
    return x

Notice that training_loss never references the reverse process at all — you train the denoiser purely on the forward (noising) direction, which is why training parallelizes perfectly across timesteps while sampling stays stubbornly sequential.

Variants worth knowing

DDIM (denoising diffusion implicit models). Replaces the stochastic reverse chain with a deterministic, non-Markovian one that follows an ODE. Same trained network, but you can skip timesteps — 50 steps for near-identical quality, or 10–20 for fast previews. Also makes generation reproducible and the latent space interpolatable.

Latent diffusion (Stable Diffusion). Diffuse in the latent space of a pretrained VAE instead of pixel space. The autoencoder handles high-frequency detail; the diffusion model only has to learn semantic structure in a much smaller tensor. This is the design that made open text-to-image practical.

Classifier-free guidance. Train the denoiser to handle both conditional and unconditional inputs (drop the condition randomly during training), then at sampling time extrapolate: ε = ε_uncond + w · (ε_cond − ε_uncond). The guidance scale w trades prompt adherence for diversity — the single knob most users actually touch.

Diffusion Transformers (DiT). Swap the U-Net for a transformer operating on latent patches. DiT scales cleanly with compute and underpins large modern systems; it is the architecture behind several frontier image and video models.

Consistency & distillation models. Train a network to map any point on the diffusion trajectory directly to its endpoint, enabling 1–4 step generation. The frontier of "diffusion quality at GAN speed."

Common bugs and edge cases

Adding noise on the final step. The reverse step samples from a Gaussian — but at t = 0 you must return the mean only. Adding that last bit of noise leaves visible grain on every output.
Wrong variance schedule. A linear schedule that's too aggressive destroys signal too early; the cosine schedule (Nichol & Dhariwal, 2021) fixes low-resolution training and is the safer default. Mismatched β between training and sampling silently degrades quality.
Forgetting the timestep embedding. The network is conditioned on t via a sinusoidal or learned embedding. Drop it and the model can't tell early steps (coarse structure) from late steps (fine detail), and outputs turn to mush.
Guidance scale too high. Cranking classifier-free guidance past ~15 over-saturates colors, blows out contrast, and collapses diversity — every sample starts to look the same.
Latent/pixel scaling mismatch. In latent diffusion the VAE latents must be scaled by the model's fixed factor (≈0.18215 for SD v1) before diffusing and unscaled after. Skip it and you get noise or washed-out color.
Training data duplication. Images repeated thousands of times get memorized and can be regurgitated near-verbatim. Deduplicate aggressively if you care about novelty or copyright exposure.

Frequently asked questions

How is a diffusion model different from a GAN?

A GAN trains a generator against a discriminator in one adversarial step and produces an image in a single forward pass — fast but unstable, prone to mode collapse. A diffusion model trains a single network with a simple regression loss to denoise, then generates by iterating that network dozens to thousands of times. Diffusion is far more stable to train and covers the data distribution better, at the cost of slow, multi-step sampling.

Why does a diffusion model predict noise instead of the clean image?

Because of how the math factorizes. The forward process adds Gaussian noise, so a noisy sample x_t is a known linear blend of the clean image x_0 and a noise vector epsilon. Predicting epsilon turns out to be equivalent (up to a constant) to estimating the score — the gradient of the log data density — which is exactly the quantity the reverse step needs. Empirically, the epsilon-prediction parameterization from the 2020 DDPM paper also trains more stably and produces sharper samples than predicting x_0 directly.

Why are diffusion models so slow to sample from?

Sampling runs the denoising network once per timestep, and the original DDPM uses 1000 steps. That is 1000 sequential forward passes — you cannot parallelize across steps because each depends on the previous output. Faster samplers like DDIM, DPM-Solver, and distillation cut this to 10-50 steps or even 1-4, but the fundamental cost is the iterative loop, unlike a GAN's single pass.

What is latent diffusion and why does Stable Diffusion use it?

Latent diffusion runs the entire diffusion process in the compressed latent space of a pretrained autoencoder rather than on raw pixels. A 512x512x3 image (about 786k values) is encoded to roughly a 64x64x4 latent (about 16k values), a ~48x reduction. The U-Net then denoises that small tensor, cutting compute and memory by more than an order of magnitude, which is what made Stable Diffusion runnable on consumer GPUs.

How does text guide image generation in a diffusion model?

The text prompt is encoded into embeddings (by CLIP or a T5 text encoder) and injected into the denoiser through cross-attention layers. Classifier-free guidance then sharpens the effect: at each step the model predicts noise both with and without the prompt, and extrapolates away from the unconditional prediction using a guidance scale (typically 5-12). Higher guidance follows the prompt more tightly but reduces diversity and can over-saturate.

Do diffusion models memorize their training images?

They can, especially for images that are duplicated many times in the training set. Studies of Stable Diffusion have extracted near-verbatim copies of training images from a small fraction of prompts. Memorization rises with duplication and with model capacity, which is why deduplicating training data and limiting per-image repetition are standard mitigations.