Machine Learning

Dropout

Randomly kill half the neurons each step — and the network stops memorizing

Dropout is a regularization technique that randomly switches off a fraction of neurons on each training step, forcing the network to spread its representation across redundant units and stop co-adapting — which sharply reduces overfitting at almost no extra cost.

IntroducedSrivastava et al., 2014
Typical rate p0.5 (dense), 0.1–0.2 (conv)
Training cost≈ one extra mask multiply
Test-time costzero (dropout off)
Implicit ensemble2ⁿ shared-weight sub-nets

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: a network that can't rely on any one neuron

Train a big neural network on a small dataset and it cheats. Instead of learning the general shape of the problem, it memorizes the training examples — building delicate chains of neurons that fire together for one specific image and nothing else. This is co-adaptation: a unit learns to fix the mistakes of another specific unit, and the pair becomes useless outside the exact training context. The training loss keeps dropping; the validation loss starts climbing. That gap is overfitting.

Dropout breaks the chains. On every training step, each neuron is independently silenced with probability p — its output forced to zero for that forward and backward pass. A neuron can never count on a particular teammate being present, because that teammate might be gone next step. So the network is pushed to make each unit individually useful and to spread the same information across many redundant paths. Geoffrey Hinton, who co-invented the method, described the inspiration as a kind of evolutionary argument: sexual reproduction breaks up co-adapted genes, and dropout breaks up co-adapted features.

The payoff is that no single neuron, and no single fragile pathway, can dominate. When you then evaluate the full network with every neuron switched back on, you get a model that has effectively been trained to work even when half its parts are missing — which is exactly the robustness that generalizes to unseen data.

The precise mechanism

Take a layer with activation vector h. Dropout draws a binary mask m the same shape as h, where each entry is an independent Bernoulli draw: mᵢ = 1 with probability 1 − p (kept) and 0 with probability p (dropped). The layer's output during training becomes h ⊙ m, an element-wise product.

That zeroing changes the expected magnitude of the signal. If a fraction p of units are zeroed, the expected sum flowing into the next layer shrinks by a factor of (1 − p). At test time nothing is dropped, so the next layer would suddenly see signals that are 1/(1−p) times larger than it was trained on. Two fixes exist:

Classic dropout: keep the mask as {0, 1} at train time, then at test time multiply every weight (or activation) by (1 − p) to scale the magnitude down.
Inverted dropout (the modern default): divide the surviving activations by (1 − p) during training, so the expected value is preserved on the spot. Test time then needs no change at all — you just remove the dropout op. Every major framework uses this.

Computationally, dropout is nearly free: it is one mask sample plus one element-wise multiply per layer, both O(d) in the layer width d. There are no extra parameters to learn and the asymptotic cost of training is unchanged. The backward pass simply routes gradients through the surviving units — the dropped units receive zero gradient, because they contributed zero to the output.

Why it works: an exponential ensemble for free

Here is the deeper reason dropout regularizes. A network with n droppable units defines 2ⁿ possible "thinned" sub-networks — one for every subset of units you could keep. These sub-networks all share the same underlying weights. Each training step samples one of them at random and takes a gradient step on it. Over many steps you are training an astronomically large family of networks, all tied together through weight sharing.

At test time you cannot average 2ⁿ models explicitly. But using the full network with weights scaled by (1 − p) turns out to be a remarkably good approximation to the geometric mean of all those sub-networks' predictions — exact for a single linear layer, and close enough in practice for deep nets. So dropout buys you ensemble-like variance reduction for the price of training one model. A real ensemble of, say, 10 independently-trained networks would cost 10× the compute and 10× the memory; dropout costs essentially nothing extra.

When to use dropout — and when not to

Large fully-connected layers on limited data — dropout's original sweet spot. Dense layers have the most parameters per unit and overfit the fastest.
The classifier head of a CNN — the final dense layers, where most modern image models still place a single dropout.
RNN/LSTM hidden-to-hidden connections — but only with a fixed per-sequence mask (variational dropout), not a fresh mask at every timestep, which would destroy the recurrent signal.
Uncertainty estimation — keep dropout on at test time and run many forward passes (Monte Carlo dropout) to get a distribution over predictions.

Skip or minimize dropout when: your model is underfitting (dropout only makes that worse); you already use batch normalization heavily in convolutional blocks; or you train a large Transformer on a massive corpus, where weight decay, augmentation, and normalization usually do the regularizing and only light dropout (≈0.1) on attention and feed-forward sublayers helps.

Dropout vs other regularizers

	Dropout	L2 / weight decay	Batch norm	Data augmentation	Early stopping	Label smoothing
What it perturbs	Activations (zeros units)	Weights (shrinks them)	Activations (normalizes)	Inputs	Training duration	Targets
Adds parameters?	No	No	Yes (γ, β)	No	No	No
Train-time cost	One mask multiply	One penalty term	Mean/var per batch	Per-sample transform	None	None
Test-time cost	Zero (off)	Zero	Running stats applied	Zero	Zero	Zero
Primary effect	Anti co-adaptation, implicit ensemble	Smaller-norm solutions	Stable gradients, mild regularization	More effective data	Stop before overfit	Less overconfident logits
Modern usage	Dense heads, Transformers (low p)	Nearly universal	Nearly universal in CNNs	Universal in vision/audio	Common	Common in classification

These are complementary, not competing. A typical modern recipe stacks weight decay + augmentation + batch/layer norm, and adds dropout selectively where dense layers still overfit. The one combination to be careful with is dropout directly before batch norm — see the variance-shift warning below.

What the numbers actually say

The 2014 paper's headline result: on the MNIST handwritten-digit benchmark, a standard feed-forward net's test error dropped from about 1.6% to 1.35% with dropout — and on harder tasks the relative improvement was larger. On TIMIT speech and CIFAR-10/100 image classification, dropout set new state-of-the-art error rates at the time.
AlexNet, 2012: the network that ignited the deep-learning era used dropout (p = 0.5) on its two huge fully-connected layers and reported that "without dropout, our network exhibits substantial overfitting." It won ImageNet by a ~10-point margin.
The exponential math is real: a single dense layer of 4,096 units (AlexNet's FC size) defines 2⁴⁰⁹⁶ sub-networks — more than the number of atoms in the observable universe (≈2²⁶⁵). You sample a vanishingly tiny fraction of them, yet weight sharing makes that enough.
Cost in wall-clock terms: dropout adds roughly the cost of one element-wise multiply per layer — typically under 1% of a layer's matmul time on a GPU — and zero parameters and zero memory at inference.
Convergence trade-off: dropout usually needs 2–3× more training epochs to converge because each step optimizes a different noisy sub-network. You pay in training time, not in model size.

JavaScript implementation

// Inverted dropout: scale during training so test time needs no change.
// Returns the masked+scaled activations and the mask (needed for backprop).
function dropoutForward(h, p, training) {
  if (!training || p === 0) {
    return { out: h.slice(), mask: null }; // identity at inference
  }
  const keep = 1 - p;
  const mask = new Float64Array(h.length);
  const out  = new Float64Array(h.length);
  for (let i = 0; i < h.length; i++) {
    mask[i] = Math.random() < keep ? 1 / keep : 0; // 1/keep folds in the rescale
    out[i]  = h[i] * mask[i];
  }
  return { out, mask };
}

// Backward: gradient flows only through the kept units, with the same scale.
function dropoutBackward(gradOut, mask) {
  if (!mask) return gradOut.slice();          // inference: pass-through
  const gradIn = new Float64Array(gradOut.length);
  for (let i = 0; i < gradOut.length; i++) {
    gradIn[i] = gradOut[i] * mask[i];          // same mask, same 1/keep factor
  }
  return gradIn;
}

// Example
const h = new Float64Array([1, 2, 3, 4, 5, 6]);
const { out, mask } = dropoutForward(h, 0.5, true);
console.log([...out]);   // e.g. [2, 0, 6, 0, 0, 12] — survivors doubled (1/0.5)

Two details to notice. First, the 1/keep factor is folded directly into the mask, so the forward pass is a single multiply and the rescaling is automatic — this is what "inverted" dropout means. Second, the same mask must be reused in the backward pass; the dropped units contributed nothing to the output, so they must receive zero gradient.

Python implementation

import numpy as np

class Dropout:
    """Inverted dropout layer (NumPy), matching PyTorch/Keras semantics."""
    def __init__(self, p=0.5):
        assert 0 <= p < 1, "drop probability must be in [0, 1)"
        self.p = p
        self.mask = None

    def forward(self, x, training=True):
        if not training or self.p == 0:
            return x                               # identity at inference
        keep = 1.0 - self.p
        # 1/keep scaling baked into the mask (inverted dropout)
        self.mask = (np.random.rand(*x.shape) < keep) / keep
        return x * self.mask

    def backward(self, grad_out):
        return grad_out * self.mask                # route grads through survivors


# The PyTorch one-liner does exactly the above:
#
#   import torch.nn as nn
#   layer = nn.Dropout(p=0.5)
#   y = layer(x)            # scales by 1/(1-p) when layer.training is True
#   model.eval()           # <-- flips training=False; dropout becomes identity
#
# Forgetting model.eval() at inference is the #1 dropout bug.

d = Dropout(p=0.5)
x = np.arange(1, 7, dtype=float)
print(d.forward(x, training=True))   # survivors scaled by 2; rest zeroed
print(d.forward(x, training=False))  # unchanged: [1. 2. 3. 4. 5. 6.]

The model.eval() / model.train() toggle is what flips the training flag for every dropout (and batch-norm) layer at once in PyTorch; Keras handles it automatically through the training argument it threads through fit versus predict.

Variants worth knowing

DropConnect (Wan et al., 2013). Instead of zeroing whole neuron outputs, zero individual weights. It is a strict generalization — dropout is the special case where you drop all the weights leaving a unit together. More flexible, but messier to implement and rarely faster in practice.

Spatial dropout (DropBlock and SpatialDropout2D). In a CNN, dropping individual pixels in a feature map barely hurts because neighbors are highly correlated. Spatial dropout instead drops entire feature-map channels, and DropBlock drops contiguous square regions, so the regularization actually bites.

Variational / recurrent dropout. For RNNs and LSTMs, sample one mask per sequence and reuse it at every timestep, rather than a fresh mask each step. This regularizes the recurrent weights without erasing the memory signal — the basis of the dropout used in many production sequence models.

Stochastic depth. Used in very deep residual networks: randomly drop entire layers (replacing them with the identity) during training. It is dropout applied at the granularity of whole residual blocks, and it lets ResNets exceed 1,000 layers.

Monte Carlo dropout (Gal & Ghahramani, 2016). Keep dropout on at test time and run the same input through the network many times. The spread of the outputs is a principled estimate of model uncertainty, connecting dropout to approximate Bayesian inference.

Common bugs and edge cases

Forgetting model.eval(). The most frequent dropout mistake by far: running inference with dropout still active randomly degrades and destabilizes predictions. Always switch to eval mode (and back to model.train() before resuming training).
Applying dropout to the output layer. Dropping logits or class probabilities adds noise to the very thing you are predicting. Dropout belongs on hidden activations, not the final outputs.
Stacking dropout right before batch norm. The "variance shift" problem: dropout changes activation variance at train time but not test time, which throws off batch norm's running statistics and can lower accuracy. Put dropout after the norm, or skip it in normalized blocks.
Using a fresh mask per timestep in an RNN. This injects independent noise at every step and corrupts the recurrent state. Use a single per-sequence mask (variational dropout) instead.
Reusing dropout on an underfitting model. Dropout only helps when the model has capacity to spare. On an underfit network it just slows learning and raises the training loss with no validation benefit.
Cranking p too high on convolutions. A 0.5 rate that is fine for dense layers will starve a conv layer of signal. Convolutional blocks want 0.1–0.2, or spatial dropout, or nothing.

Frequently asked questions

What dropout rate should I use?

The original paper recommends p = 0.5 for hidden layers and p = 0.2 (or none) on the input layer, and those defaults still work for fully-connected nets. Convolutional layers tolerate far less — 0.1 to 0.2 — because weight sharing already regularizes them. If your model isn't overfitting, lower the rate or remove dropout entirely; dropout on an underfit model just slows learning.

Why does dropout multiply activations by 1/(1-p) during training?

That is inverted dropout. Zeroing a fraction p of units lowers the expected sum entering the next layer by a factor of (1-p). Dividing the surviving activations by (1-p) restores the expected magnitude, so test time — where nothing is dropped — needs no rescaling. Without inversion you would have to multiply every weight by (1-p) at inference instead.

Do I apply dropout at test time?

No. At inference dropout is turned off and the full network is used, which approximates averaging over the exponentially many sub-networks seen during training. Forgetting to switch to eval mode — model.eval() in PyTorch, or training=False in Keras — is the single most common dropout bug and silently degrades accuracy. The exception is Monte Carlo dropout, where you deliberately keep it on to estimate uncertainty.

Does dropout work with batch normalization?

They interact badly when stacked naively. Batch norm's statistics shift between train and test, and dropout changes the variance of activations, so combining them can hurt accuracy — a 2018 study called this the "variance shift" problem. The common practice in modern CNNs is to rely on batch norm and skip dropout in convolutional blocks, reserving dropout for the final fully-connected head if at all.

Is dropout the same as an ensemble?

It approximates one. A network with n droppable units defines 2^n possible thinned sub-networks that all share weights. Training samples a different sub-network each step, and the full network at test time acts as a geometric-mean ensemble of all of them. The catch: a real ensemble trains independent models, while dropout's sub-networks are tied together by shared weights, so it is a cheap approximation rather than a true ensemble.

Why has dropout fallen out of favor in Transformers and modern CNNs?

Large models trained on huge datasets overfit less, and other regularizers crowded dropout out: batch/layer normalization, weight decay, data augmentation, and stochastic depth. Standard dropout on every layer can even slow convergence at scale. Transformers still use it, but at low rates (0.1) and mostly on attention weights and the feed-forward sublayer, not everywhere.