Machine Learning

Autoencoder

Q: Why doesn't an autoencoder just learn the identity function?

It would if the bottleneck were as wide as the input — then the network could copy values through untouched. The whole trick is making the middle layer narrower than the input (an undercomplete autoencoder), so the network is forced to throw information away and keep only what best reconstructs the data. Regularized variants like sparse or denoising autoencoders achieve the same pressure without shrinking the layer.

Q: What is the difference between an autoencoder and PCA?

A linear autoencoder with one bottleneck layer and squared-error loss learns the same subspace as PCA — it spans the top-k principal components. Add nonlinear activations and depth and the autoencoder learns a curved manifold that PCA, being strictly linear, cannot. The cost is that autoencoder axes are not orthogonal or ordered by variance the way principal components are.

Q: Can you generate new data by sampling the latent space?

Not reliably with a plain autoencoder — its latent space has gaps and clusters, so a random point usually decodes to garbage. A variational autoencoder (VAE) fixes this by forcing the latent distribution toward a standard Gaussian with a KL-divergence term, so sampling from N(0, I) and decoding produces plausible new samples.

Q: How is a denoising autoencoder trained?

You corrupt each input — add Gaussian noise, zero out random pixels, mask tokens — feed the corrupted version in, but compute the loss against the clean original. The network can't memorize a copy because the input no longer matches the target, so it learns the underlying structure needed to undo the corruption. Masked language models like BERT are this idea applied to text.

Q: What loss function does an autoencoder use?

Mean squared error for real-valued inputs like images normalized to [0,1], and binary cross-entropy per pixel or feature when inputs are in [0,1] and treated as probabilities. A VAE adds a KL-divergence regularizer to the reconstruction loss. The choice matters: MSE blurs fine texture because it averages plausible reconstructions, which is why generative work often swaps in perceptual or adversarial losses.

Q: Are autoencoders still used now that we have transformers?

Yes, heavily — just not as standalone generators. The VAE is the compression front-end of latent diffusion models like Stable Diffusion, shrinking a 512×512 image to a 64×64 latent so the diffusion process runs ~48× cheaper. Autoencoders also power anomaly detection, learned image and audio codecs, and the masked-prediction pretraining behind BERT and Masked Autoencoders for vision.

Squeeze data through a bottleneck and learn what matters

An autoencoder is a neural network trained to copy its input to its output through a narrow bottleneck, forcing it to learn a compressed latent representation of the data without any labels.

Training signalSelf-supervised (input = target)
Core constraintBottleneck dim < input dim
Reconstruction lossMSE or BCE
Linear AE equalsPCA subspace
Generative cousinVAE (KL-regularized)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How an autoencoder works

An autoencoder is two networks bolted together at the waist. The encoder takes an input x — say a 784-pixel MNIST digit — and squeezes it down through progressively smaller layers into a tiny vector z, the latent code or bottleneck. The decoder takes z and expands it back out to a reconstruction x̂ the same shape as the input. The only thing you train it to do is make x̂ match x.

That sounds pointless — why train a network to output what you already gave it? The answer is the bottleneck. If z has 32 numbers and x has 784, the network physically cannot copy the input through. It has to discover which 32 numbers best summarize a handwritten digit so the decoder can redraw it. Those 32 numbers are a learned, lossy compression — and crucially, the network learns it with no labels at all. The target is the input itself, which is why autoencoders are called self-supervised.

Two design choices define an autoencoder:

The bottleneck size. Smaller forces harder compression and a more abstract code, but past a point reconstructions blur. This is the bias–variance dial of the whole model.
The capacity constraint. An undercomplete autoencoder constrains by making the bottleneck literally narrower than the input. A regularized autoencoder can keep the bottleneck wide but penalize the code — sparsity penalties, noise injection, or a KL term — so it still can't cheat.

The mechanism and the math

Write the encoder as a function f and the decoder as g, each a stack of affine maps and nonlinearities:

z  = f(x) = σ(W₂ · σ(W₁ x + b₁) + b₂)      // encoder → latent code z ∈ ℝᵏ
x̂  = g(z) = σ(W₃ · σ(W₄ z + b₃) + b₄)      // decoder → reconstruction x̂ ∈ ℝⁿ
L(x) = ‖ x − x̂ ‖²                            // reconstruction loss (MSE)

You minimize the loss over the dataset with ordinary backpropagation and gradient descent — the same machinery as any supervised net, except the "label" is the input. A forward pass costs O(W) where W is the total number of weights; for a dense autoencoder mapping n inputs through hidden width h to a k-dim code, that's O(n·h + h·k) multiply-adds per example for the encoder (the decoder mirrors it). Training is O(E · N · W) for E epochs over N examples.

The cleanest theoretical result: a linear autoencoder is PCA. Strip out all nonlinearities (set σ to identity), use a single bottleneck of width k, and minimize squared error. The optimum spans exactly the same k-dimensional subspace as the top k principal components — Baldi and Hornik proved this in 1989. The catch is that the learned axes are an arbitrary rotation within that subspace, not the ordered orthonormal eigenvectors PCA gives you. Add nonlinearities back and the autoencoder can wrap the data around a curved manifold that no linear projection can capture — that's the whole reason to use one over PCA.

When to use an autoencoder

Anomaly detection. Train only on normal data. At inference, a high reconstruction error flags an input the model has never learned to compress — fraud, a defective part, a machine about to fail. The threshold is just a percentile of training error.
Dimensionality reduction when the structure is nonlinear and PCA leaves too much variance unexplained. The latent code feeds a downstream classifier or a 2-D t-SNE plot.
Denoising and restoration. A denoising autoencoder cleans corrupted images, audio, or sensor streams.
Learned compression / codecs. The latent code is the compressed file; ship z and the decoder reconstructs.
Generative pretraining. The VAE is the workhorse front-end of latent diffusion, and masked autoencoding pretrains BERT and vision transformers.

Reach for something else when: you have labels and a specific target (train a supervised net directly — the autoencoder code is unsupervised and won't optimize for your task); you need crisp generated samples (a plain AE's latent space is gappy — use a VAE or diffusion model); or your data is genuinely low-dimensional and linear (PCA is faster, deterministic, and needs no GPU).

Autoencoder vs other representation learners

	Plain AE	PCA	VAE	t-SNE / UMAP	Diffusion model
Mapping	Nonlinear, learned	Linear, closed-form	Nonlinear, probabilistic	Nonlinear, non-parametric	Nonlinear, iterative
Has a decoder (can reconstruct)	Yes	Yes	Yes	No	Yes
Can generate new samples	Poorly (gappy latent)	No	Yes (sample N(0,I))	No	Yes (state of the art)
Latent space structure	Arbitrary, clustered	Orthogonal, variance-ordered	Smooth, ~Gaussian	Preserves local neighborhoods	Pixel/latent space
Training cost	Moderate (GPU)	Cheap (one SVD)	Moderate (GPU)	Expensive per-dataset, no reuse	Very high
Inference cost	One forward pass	One matrix multiply	One forward pass	Must re-fit for new points	Many denoising steps
Typical use	Anomaly, denoise, pretrain	Quick reduction, whitening	Generation, latent diffusion AE	2-D visualization only	Image/audio generation

The headline split: PCA is the fast linear baseline you should always try first; the plain autoencoder buys nonlinearity at the cost of a training loop and a messier latent space; the VAE buys a usable latent space for generation at the cost of a KL term and slightly blurrier reconstructions.

What the numbers actually say

Compression ratio. A 784-pixel MNIST digit through a 32-unit bottleneck is a 24.5× reduction in dimensionality. Reconstructions stay legible because handwritten digits live on a far-lower-dimensional manifold than 784 free pixels would suggest.
Why latent diffusion uses one. Stable Diffusion's VAE maps a 512×512×3 image (786,432 numbers) to a 64×64×4 latent (16,384 numbers) — a 48× reduction in element count (8× per spatial side, with channels growing 3→4). Running the expensive diffusion process in that latent space rather than on raw pixels is the single change that made high-resolution text-to-image practical on consumer GPUs.
Linear AE recovers PCA's variance. On a dataset where the top 32 principal components explain, say, 90% of variance, a 32-unit linear autoencoder converges to the same 90% — no more, no less. Going nonlinear is what pushes past that ceiling.
Anomaly thresholds are tiny to compute. Detection is a single forward pass plus a squared-error sum: microseconds per sample on a GPU, versus retraining a supervised classifier every time the definition of "normal" drifts.

JavaScript implementation

A minimal dense autoencoder trained by hand — no framework — to show the full forward/backward loop. One hidden bottleneck, sigmoid activations, MSE loss, plain SGD.

// Undercomplete autoencoder: nIn -> nLatent -> nIn
const sigmoid  = z => 1 / (1 + Math.exp(-z));
const dsigmoid = a => a * (1 - a);          // a is already sigmoid(z)
const randMat  = (r, c) => Array.from({length: r}, () =>
  Array.from({length: c}, () => (Math.random() - 0.5) * 0.4));

function makeAE(nIn, nLatent) {
  return {
    W1: randMat(nLatent, nIn), b1: new Array(nLatent).fill(0),  // encoder
    W2: randMat(nIn, nLatent), b2: new Array(nIn).fill(0),      // decoder
  };
}

function forward(ae, x) {
  const z = ae.b1.map((b, j) =>
    sigmoid(b + ae.W1[j].reduce((s, w, i) => s + w * x[i], 0)));   // latent code
  const xhat = ae.b2.map((b, k) =>
    sigmoid(b + ae.W2[k].reduce((s, w, j) => s + w * z[j], 0)));   // reconstruction
  return { z, xhat };
}

function trainStep(ae, x, lr = 0.1) {
  const { z, xhat } = forward(ae, x);
  // dL/dpre for output layer (MSE through sigmoid):
  const dOut = xhat.map((o, k) => (o - x[k]) * dsigmoid(o));
  // backprop into latent layer:
  const dLat = z.map((zj, j) =>
    dsigmoid(zj) * dOut.reduce((s, d, k) => s + d * ae.W2[k][j], 0));
  // update decoder then encoder:
  for (let k = 0; k < xhat.length; k++) {
    for (let j = 0; j < z.length; j++) ae.W2[k][j] -= lr * dOut[k] * z[j];
    ae.b2[k] -= lr * dOut[k];
  }
  for (let j = 0; j < z.length; j++) {
    for (let i = 0; i < x.length; i++) ae.W1[j][i] -= lr * dLat[j] * x[i];
    ae.b1[j] -= lr * dLat[j];
  }
  return xhat.reduce((s, o, k) => s + (o - x[k]) ** 2, 0);  // loss
}

// Squeeze an 8-dim input through a 3-dim bottleneck:
const ae = makeAE(8, 3);
const data = [[1,0,0,1,1,0,0,1], [0,1,1,0,0,1,1,0], [1,1,0,0,1,1,0,0]];
for (let epoch = 0; epoch < 5000; epoch++)
  for (const x of data) trainStep(ae, x);
console.log(forward(ae, data[0]).z);   // the learned 3-number code

The encoder weights W1 and decoder weights W2 are independent here. Tying them — forcing W2 = W1ᵀ — halves the parameters and acts as a regularizer; it was standard in the 2006–2012 stacked-autoencoder era.

Python implementation

The same idea in PyTorch, where autograd handles the backward pass and you'd actually train one in practice.

import torch, torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, n_in=784, n_latent=32):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(n_in, 128), nn.ReLU(),
            nn.Linear(128, n_latent),            # the bottleneck
        )
        self.decoder = nn.Sequential(
            nn.Linear(n_latent, 128), nn.ReLU(),
            nn.Linear(128, n_in), nn.Sigmoid(),  # outputs in [0, 1]
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z), z

model = Autoencoder()
opt   = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

for x, _ in loader:                       # labels (_) are ignored — self-supervised
    x = x.view(x.size(0), -1)             # flatten 28x28 -> 784
    x_hat, z = model(x)
    loss = loss_fn(x_hat, x)              # target IS the input
    opt.zero_grad(); loss.backward(); opt.step()

# --- Variational autoencoder: the generative upgrade ---
class VAE(nn.Module):
    def __init__(self, n_in=784, n_latent=32):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(n_in, 128), nn.ReLU())
        self.mu      = nn.Linear(128, n_latent)
        self.logvar  = nn.Linear(128, n_latent)
        self.dec = nn.Sequential(
            nn.Linear(n_latent, 128), nn.ReLU(),
            nn.Linear(128, n_in), nn.Sigmoid())

    def forward(self, x):
        h = self.enc(x)
        mu, logvar = self.mu(h), self.logvar(h)
        std = torch.exp(0.5 * logvar)
        z = mu + std * torch.randn_like(std)   # reparameterization trick
        return self.dec(z), mu, logvar

def vae_loss(x_hat, x, mu, logvar):
    recon = nn.functional.binary_cross_entropy(x_hat, x, reduction='sum')
    kl    = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon + kl       # reconstruction + KL pulls latent toward N(0, I)

The VAE's two changes are the whole story: the encoder outputs a distribution (a mean and a variance) rather than a point, and the KL term forces that distribution toward a standard Gaussian. The reparameterization trick — sampling z = μ + σ·ε with ε ~ N(0,I) — keeps the randomness outside the gradient path so backprop still works.

Variants worth knowing

Denoising autoencoder (DAE). Corrupt the input, reconstruct the clean original. Can't memorize a copy because input ≠ target, so it learns robust structure. Vincent et al. introduced it in 2008; masked language models like BERT are the same idea on text.

Sparse autoencoder. Keep the bottleneck wide but add an L1 or KL sparsity penalty so only a few latent units fire per input. Recently central to mechanistic interpretability — sparse autoencoders are used to pull human-readable features out of large language models' activations.

Variational autoencoder (VAE). Kingma and Welling, 2013. Makes the latent space a smooth probability distribution so you can sample it to generate new data. The compression front-end of latent diffusion.

Contractive autoencoder (CAE). Penalizes the Frobenius norm of the encoder's Jacobian, so the code barely changes when the input is perturbed — explicitly trading reconstruction for robustness.

Masked autoencoder (MAE). He et al., 2021. Mask 75% of image patches and reconstruct them. A scalable self-supervised pretraining recipe for vision transformers.

Common bugs and edge cases

Bottleneck too wide → identity function. If the code is as large as the input, the network learns to copy and the loss goes to zero while learning nothing useful. Shrink the bottleneck or add a regularizer.
Activation–loss mismatch. A sigmoid output layer needs inputs in [0,1]; pair it with BCE or MSE. Feeding unnormalized pixels (0–255) into a sigmoid saturates every unit and the model never trains.
Expecting to sample a plain AE. Decoding a random latent vector from a non-variational autoencoder almost always produces garbage — the latent space has holes between the clusters the encoder mapped data into. Use a VAE to sample.
Blurry reconstructions blamed on the model. MSE averages over all plausible reconstructions, which mathematically produces blur. It's the loss, not a bug — switch to perceptual or adversarial loss if sharpness matters.
VAE posterior collapse. With a too-powerful decoder, the KL term wins and the encoder ignores the input — every code collapses to N(0,I) and reconstructions become an average digit. Mitigate with KL annealing or a weaker decoder.
Anomaly detection that flags everything. If you train on data that already contains anomalies, the model learns to reconstruct them too, and the error signal vanishes. Train on a clean, normal-only set.

Frequently asked questions

Why doesn't an autoencoder just learn the identity function?