Machine Learning

Generative Adversarial Network (GAN)

Two networks at war — and the forgeries win

A generative adversarial network (GAN) trains two neural networks against each other — a generator that fabricates samples and a discriminator that judges them — until the fakes are statistically indistinguishable from real data.

IntroducedGoodfellow et al., 2014
TrainingTwo-player minimax game
Optimal discriminatorD(x) = 0.5 everywhere
Implicit divergenceJensen–Shannon (vanilla)
Quality metricFID (lower is better)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The forger and the detective

Picture an art forger and a museum detective locked in a room. The forger paints fakes; the detective inspects each canvas and stamps it "real" or "fake." Early on the forgeries are laughable and the detective catches every one. But each time the detective explains why a painting looked fake, the forger improves. The detective sharpens in response. Run this loop long enough and the forger produces canvases the detective can only guess at — a coin flip. That coin flip is the whole point.

That is a generative adversarial network. Ian Goodfellow and colleagues introduced it in 2014. The forger is the generator G; the detective is the discriminator D. The generator never sees a single real sample — it only ever receives feedback through the discriminator's verdict. It learns the data distribution implicitly, by being told over and over how a critic distinguishes its output from the genuine article.

The generator takes a random noise vector z (typically drawn from a standard normal, say 100 or 128 dimensions) and maps it through a neural network to a sample G(z) — an image, an audio clip, a row of tabular data. The discriminator takes a sample and outputs a single number in [0, 1]: the probability it thinks the sample is real. The two are trained simultaneously, and their objectives are exact opposites. One network's loss is the other's gain.

The minimax objective

The original GAN objective is a single value function that the two players push in opposite directions:

min  max  V(D, G) = E_{x~p_data}[ log D(x) ] + E_{z~p_z}[ log(1 − D(G(z))) ]
 G    D

Read it left to right. The discriminator D wants to maximize V: push D(x) toward 1 on real data and D(G(z)) toward 0 on fakes. The generator G wants to minimize V — it can only touch the second term, and it does so by making D(G(z)) large, i.e. fooling the detective.

Goodfellow's paper proves two clean results. First, for a fixed generator the optimal discriminator is

D*(x) = p_data(x) / ( p_data(x) + p_g(x) )

Second, plug that back in and the generator is effectively minimizing the Jensen–Shannon divergence between the real distribution p_data and its own distribution p_g. The global minimum is reached exactly when p_g = p_data, at which point D*(x) = 1/2 everywhere — the detective is reduced to a coin flip. That is the Nash equilibrium of the game.

In practice you do not train D to optimality between each generator step — that's only the theoretical setup. The standard recipe alternates one (or a few) discriminator updates with one generator update, each on a minibatch, using stochastic gradient descent. Crucially, Goodfellow noticed the generator term log(1 − D(G(z))) saturates: when D easily rejects fakes early on, that term's gradient vanishes precisely when the generator needs signal. The fix used in essentially every real implementation is the non-saturating loss: instead of minimizing log(1 − D(G(z))), maximize log D(G(z)). Same fixed point, far healthier gradients.

When to reach for a GAN

You want sharp, high-fidelity samples — photorealistic faces, textures, super-resolution. GANs produce crisper images than likelihood-based models, which tend to hedge and blur.
You don't need an explicit likelihood or a latent encoder. If your task is "draw me more things that look like this," not "tell me how probable this sample is," a GAN fits.
Image-to-image translation and editing — pix2pix, CycleGAN (unpaired translation), style transfer, inpainting, and photo enhancement are GAN-native problems.
Data augmentation in low-data regimes, where you synthesize plausible extra training examples.

Reach for something else when you need calibrated probabilities or stable, easy-to-monitor training (use a VAE or a normalizing flow), when you want best-in-class likelihood and sample diversity at scale (diffusion models now dominate text-to-image), or when you simply want a fast, reliable pipeline — GAN training is notoriously finicky and can diverge or collapse with no warning.

GAN vs other generative models

	GAN	VAE	Diffusion model	Normalizing flow	Autoregressive (PixelCNN/Transformer)
Training objective	Adversarial minimax (implicit)	Evidence lower bound (ELBO)	Denoising score / variational bound	Exact log-likelihood	Exact log-likelihood (chain rule)
Sample quality	Very sharp	Blurry	State of the art	Moderate	High
Sample speed	One forward pass (fast)	One forward pass (fast)	10–1000 denoising steps (slow)	One pass (fast)	One token at a time (slow)
Likelihood available	No (implicit)	Lower bound only	Lower bound / approx	Exact	Exact
Training stability	Fragile (collapse, oscillation)	Stable	Stable	Stable	Stable
Mode coverage	Prone to mode collapse	Good	Excellent	Good	Excellent
Latent space	Smooth, no encoder	Smooth, has encoder	Noise schedule, no compact latent	Invertible, exact	None (pixel/token space)

The one-line summary: GANs trade stability and likelihood for the sharpest single-pass samples. Diffusion models have since overtaken GANs on raw fidelity and diversity at the cost of slow, multi-step sampling, which is why text-to-image systems mostly use diffusion while real-time and on-device generation still favor GANs.

What the numbers actually say

Sampling cost is one forward pass. A trained StyleGAN2 generates a 1024×1024 image in a single network evaluation — on an A100 that's on the order of milliseconds. A diffusion model at comparable quality needs 25–250 sequential network calls, so GANs are roughly 1–2 orders of magnitude faster at inference. That gap is why GANs still win for real-time avatars and games.
FID is the standard yardstick. The Fréchet Inception Distance compares 50,000 real and 50,000 generated samples in Inception-v3 feature space. On the CelebA-HQ face benchmark, strong GANs reach FID in the low single digits; a random or collapsed model scores in the hundreds.
Discriminator-to-generator update ratio matters. Vanilla GANs often use 1:1; Wasserstein GANs typically run the critic 5 steps per generator step (n_critic = 5) because the critic must approximate the Wasserstein distance well before each generator move.
The 2014 paper trained on MNIST, TFD, and CIFAR-10 with multilayer perceptrons — no convolutions. DCGAN (2015) added the convolutional architecture that made GANs actually work on images, and progressive growing plus StyleGAN (2018–2019) pushed them to megapixel photorealism.

JavaScript implementation

The training loop is the part worth seeing in code — it's where every GAN bug lives. This is the canonical alternating update with the non-saturating generator loss, written against a tiny autograd-style API so the structure is clear:

// G: noise z -> fake sample.  D: sample -> probability(real) in (0,1).
// optG / optD are separate optimizers so each net only updates its own weights.

function sampleNoise(batch, dim) {
  // standard normal latent vector z
  return tensorRandn([batch, dim]);
}

function trainStep(realBatch, G, D, optG, optD, zDim) {
  const m = realBatch.shape[0];

  // ---- 1. Update the discriminator: maximize log D(x) + log(1 - D(G(z))) ----
  const z      = sampleNoise(m, zDim);
  const fake   = G.forward(z).detach();          // detach: no G gradient here
  const dReal  = D.forward(realBatch);           // want -> 1
  const dFake  = D.forward(fake);                // want -> 0
  // binary cross-entropy: real labelled 1, fake labelled 0
  const lossD  = bce(dReal, ones(m)).add(bce(dFake, zeros(m)));
  optD.zeroGrad(); lossD.backward(); optD.step();

  // ---- 2. Update the generator: NON-SATURATING, maximize log D(G(z)) ----
  const z2     = sampleNoise(m, zDim);
  const dGen   = D.forward(G.forward(z2));        // gradients flow into G
  // label the fakes as 1 so G is pushed to fool D
  const lossG  = bce(dGen, ones(m));
  optG.zeroGrad(); lossG.backward(); optG.step();

  return { lossD: lossD.item(), lossG: lossG.item() };
}

Two details carry the whole algorithm. First, the .detach() on the fake samples during the discriminator step: you must stop gradients from flowing back into G while you train D, otherwise the generator gets nudged toward making the discriminator's job easier. Second, the generator step re-runs G on fresh noise and labels its fakes as 1 — that label flip is the non-saturating loss, since BCE against label 1 is exactly −log D(G(z)).

Python (PyTorch) implementation

The same loop in idiomatic PyTorch, which is how you'd actually write it:

import torch, torch.nn as nn

bce = nn.BCELoss()

def train_step(real, G, D, opt_g, opt_d, z_dim, device):
    m = real.size(0)
    real_label = torch.ones(m, 1, device=device)
    fake_label = torch.zeros(m, 1, device=device)

    # ---- 1. Discriminator ----
    z    = torch.randn(m, z_dim, device=device)
    fake = G(z).detach()                 # block G's gradients
    loss_d = bce(D(real), real_label) + bce(D(fake), fake_label)
    opt_d.zero_grad(); loss_d.backward(); opt_d.step()

    # ---- 2. Generator (non-saturating) ----
    z2     = torch.randn(m, z_dim, device=device)
    loss_g = bce(D(G(z2)), real_label)   # label fakes as 1 -> maximize log D(G(z))
    opt_g.zero_grad(); loss_g.backward(); opt_g.step()
    return loss_d.item(), loss_g.item()


class Generator(nn.Module):
    def __init__(self, z_dim=100, out=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(z_dim, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 512),   nn.LeakyReLU(0.2),
            nn.Linear(512, out),   nn.Tanh())          # outputs in [-1, 1]
    def forward(self, z): return self.net(z)


class Discriminator(nn.Module):
    def __init__(self, inp=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(inp, 512), nn.LeakyReLU(0.2), nn.Dropout(0.3),
            nn.Linear(512, 256), nn.LeakyReLU(0.2), nn.Dropout(0.3),
            nn.Linear(256, 1),   nn.Sigmoid())         # probability real
    def forward(self, x): return self.net(x)

Note the architectural conventions that the DCGAN paper found make training stable: LeakyReLU in the discriminator (a dying ReLU starves the generator of gradient), a Tanh output on the generator so samples match real data normalized to [-1, 1], and the Adam optimizer with betas=(0.5, 0.999) rather than the default 0.9.

Variants worth knowing

DCGAN (2015). Deep Convolutional GAN. Replaced the MLPs with strided/transposed convolutions, batch norm, and the LeakyReLU/Tanh conventions above. This is the architecture that made GANs work on images and the template most later work builds on.

Conditional GAN (cGAN). Feed a class label (or any side information) into both G and D. Now you can ask for "a 7" instead of "some digit." pix2pix conditions on an entire input image for image-to-image translation.

Wasserstein GAN (WGAN / WGAN-GP). Swaps Jensen–Shannon divergence for the Earth-Mover distance, which stays smooth even when the distributions don't overlap — curing vanishing gradients and most mode collapse. The discriminator becomes a "critic" outputting an unbounded real score, and a gradient penalty (WGAN-GP) enforces the 1-Lipschitz constraint the math requires.

CycleGAN. Unpaired image-to-image translation (horses ↔ zebras, summer ↔ winter) using two GANs plus a cycle-consistency loss so translating there-and-back returns the original.

StyleGAN / StyleGAN2 / StyleGAN3. NVIDIA's line that produces photorealistic megapixel faces, introducing a style-based generator with per-layer modulation and the now-famous latent W space that enables smooth, disentangled editing.

Common bugs and failure modes

Mode collapse. The generator finds one or a few outputs that reliably fool D and stops covering the rest of the distribution — generating only one digit, or near-identical faces. Mitigations: minibatch discrimination, unrolled GANs, WGAN loss, and two-time-scale update rules.
Forgetting to .detach() in the discriminator step. If the fake tensor still carries the generator's graph, the discriminator update will also (wrongly) push G, and training destabilizes. Detach the fakes, or zero only D's optimizer.
Using the saturating generator loss. Minimizing log(1 − D(G(z))) gives near-zero gradient while D is winning — the generator stalls at the start. Use the non-saturating form (maximize log D(G(z))).
Discriminator overpowering the generator. If D becomes perfect, its gradient to G vanishes and learning stops. Balance with label smoothing (label reals as 0.9), instance noise, or fewer D steps.
Watching the loss curves for convergence. GAN losses oscillate by design and don't trend to zero — a decreasing generator loss can even mean D is getting worse. Judge progress with FID and by eyeballing samples, not the loss.
Batch-norm statistics leaking real/fake. Mixing real and fake samples in one batch-norm pass lets the discriminator cheat off batch statistics. Keep real and fake batches separate, or use other normalizations in the critic.

Frequently asked questions

Why is a GAN called a minimax game?

The two networks optimize the same objective in opposite directions. The discriminator maximizes its accuracy at telling real from fake; the generator minimizes that same accuracy. At the theoretical optimum the value function reaches a saddle point — a Nash equilibrium where the discriminator outputs 0.5 everywhere because it can no longer distinguish the two distributions.

What is mode collapse in GANs?

Mode collapse is when the generator learns to produce only a few outputs that reliably fool the discriminator, ignoring the rest of the data distribution — for example generating one convincing digit instead of all ten. The generator wins locally but the samples lack diversity. Minibatch discrimination, unrolled GANs, and Wasserstein loss all reduce it.

What's the difference between a GAN and a VAE?

A VAE maximizes a likelihood lower bound and reconstructs data through an explicit encoder–decoder, giving stable training and blurry samples. A GAN has no explicit likelihood — it learns implicitly through the adversarial signal, producing sharper samples but with unstable, harder-to-monitor training and no built-in encoder.

Why does the non-saturating generator loss work better than the original?

The original generator loss log(1 − D(G(z))) saturates when the discriminator is confident early in training: its gradient goes to zero exactly when the generator needs it most. Goodfellow's fix is to instead maximize log D(G(z)), which has the same fixed point but a strong gradient when the discriminator is winning.

What does the Wasserstein GAN fix?

WGAN replaces the Jensen–Shannon divergence with the Earth-Mover (Wasserstein-1) distance, which stays smooth even when the real and fake distributions don't overlap. That removes vanishing gradients and mode collapse in practice, and the critic's loss becomes a meaningful, decreasing measure of sample quality. WGAN-GP enforces the required 1-Lipschitz constraint with a gradient penalty instead of weight clipping.

How do you know when a GAN has finished training?

There is no loss curve that monotonically decreases to convergence — the losses oscillate by design. Practitioners track sample quality with the Fréchet Inception Distance (FID): generate ~50,000 samples, embed real and fake sets through Inception-v3, and compare the two Gaussians. Lower FID means closer distributions; on many tasks you stop when FID stops improving.