Machine Learning
Weight Initialization (Xavier & He)
The two numbers that decide whether a deep net trains at all
Weight initialization sets a neural network's starting weights so signal variance stays roughly 1 as it flows through layers; Xavier scales by 1/fan_in for tanh, He by 2/fan_in for ReLU, preventing vanishing or exploding activations.
- Xavier (Glorot) variance1 / fan_in
- He (Kaiming) variance2 / fan_in
- Target per-layer variance factor1.0
- Init costO(fan_in · fan_out)
- Use Xavier fortanh / sigmoid
- Use He forReLU / Leaky ReLU
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why the starting weights decide everything
A deep network is a tall stack of matrix multiplies. Each layer takes the previous activations, multiplies by a weight matrix, and applies a nonlinearity. Picture the variance of those activations — how spread out the numbers are — as a signal flowing upward through the stack. If each layer multiplies that variance by 1.1, then after 50 layers the signal has grown by a factor of 1.150 ≈ 117. If each layer multiplies it by 0.9, after 50 layers it has shrunk to 0.950 ≈ 0.005. The first case is exploding activations; the second is vanishing activations. Both kill training: explode and you get NaNs, vanish and the gradients are too small to move the weights.
The whole job of weight initialization is to make that per-layer multiplier equal to exactly 1, so the signal neither grows nor shrinks as it propagates through dozens of layers. The trick is that the multiplier depends on the variance of the weights and the fan-in — the number of inputs being summed into each neuron. Set the weight variance correctly relative to the fan-in and the layers compose into a stable signal path. Set it wrong and no amount of clever optimizer or learning-rate schedule will save you, because the first forward pass already destroyed the signal.
Before 2010 this was a black art — people scaled weights by hand and deep nets were notoriously hard to train. Xavier Glorot and Yoshua Bengio's 2010 paper, and Kaiming He et al.'s 2015 follow-up for ReLU, turned it into two clean formulas. Those two formulas are the difference between a 30-layer net that converges in an hour and one that never leaves its random starting loss.
The variance math, derived
Consider one neuron computing y = Σ wᵢ xᵢ over its n inputs (where n is the fan-in). Assume the weights wᵢ and inputs xᵢ are independent, zero-mean, and identically distributed. The variance of a sum of independent zero-mean products is:
Var(y) = Σ Var(wᵢ · xᵢ) = n · Var(w) · Var(x)
For the activation variance to be preserved across the layer — Var(y) = Var(x) — that whole expression must equal Var(x), which forces:
n · Var(w) = 1 ⟹ Var(w) = 1 / fan_in
That is Xavier (Glorot) initialization. The original paper wanted to balance both the forward pass (fan-in) and the backward pass for gradients (fan-out), so it averaged the two: Var(w) = 2 / (fan_in + fan_out). The simplified 1/fan_in form (sometimes called "LeCun initialization") only stabilizes the forward pass, which is usually enough.
He (Kaiming) initialization fixes Xavier's hidden assumption. The derivation above assumed the nonlinearity preserves variance — true for the identity and roughly true for tanh near zero. But ReLU sets all negative inputs to zero, throwing away half the distribution and halving the variance of what passes through. To compensate, He doubles the weight variance:
Var(w) = 2 / fan_in (He / Kaiming, for ReLU)
So the entire difference between the two famous methods is a factor of 2, and that factor is exactly the fraction of the ReLU input that survives. The cost to initialize a layer is O(fan_in · fan_out) — you fill one random number per weight, done once before training, dwarfed by the cost of a single training step.
Which init to reach for
- ReLU, Leaky ReLU, ELU, GELU and friends → He. Any activation that zeros or attenuates the negative half. This is the default for modern CNNs and most MLPs.
- tanh, sigmoid, softsign → Xavier. Symmetric, roughly variance-preserving activations near the origin.
- Linear layers / no activation → Xavier (or LeCun 1/fan_in). The identity preserves variance exactly, so Xavier's assumption holds literally.
- SELU (self-normalizing nets) → LeCun normal. SELU is designed around
Var(w) = 1/fan_inwith a normal distribution; using He here breaks the self-normalizing property. - Transformers → scaled variants. Attention and residual stacks need extra care; see the variants section. A naive He on a 96-layer transformer still explodes through the residual path.
The rule of thumb: match the init to the activation's variance behavior. He overestimates variance for tanh (signal slowly explodes); Xavier underestimates it for ReLU (signal slowly vanishes). Neither is catastrophic for a shallow net, but the error compounds with depth.
Xavier vs He vs the naive baselines
| Xavier / Glorot | He / Kaiming | LeCun | All zeros | N(0, 1) fixed | Small N(0, 0.01) | |
|---|---|---|---|---|---|---|
| Weight variance | 1/fan_in (or 2/(fan_in+fan_out)) | 2/fan_in | 1/fan_in | 0 | 1 | 0.0001 |
| Designed for | tanh, sigmoid, linear | ReLU family | SELU, linear | — | — | — |
| Year / authors | 2010, Glorot & Bengio | 2015, He, Zhang, Ren, Sun | 1998, LeCun et al. | — | — | — |
| Symmetry broken? | Yes | Yes | Yes | No — fatal | Yes | Yes |
| Deep ReLU net (50 layers) | Slowly vanishes (~2× per layer too small) | Stable, Var≈1 | Slowly vanishes | Never learns | Explodes to NaN | Vanishes to ~0 |
| Deep tanh net (50 layers) | Stable, Var≈1 | Slowly explodes / saturates | Stable | Never learns | Saturates instantly | Vanishes to ~0 |
| Per-element cost | O(1) | O(1) | O(1) | O(1) | O(1) | O(1) |
The headline: there is no init that works for every activation, but matching the init's variance assumption to the activation removes the single biggest failure mode of deep networks. The naive baselines are instructive — all-zeros never breaks symmetry, fixed unit variance explodes, and the once-popular "tiny N(0, 0.01)" trick vanishes by depth 10 — which is exactly why people thought deep nets were untrainable before 2010.
What the numbers actually say
- The variance factor compounds geometrically. A layer that multiplies activation variance by 0.5 leaves
0.530 ≈ 9 × 10⁻¹⁰of the signal after 30 layers — a billion-fold attenuation, well into float32 underflow territory for gradients. - He vs Xavier on a 30-layer ReLU MLP. Glorot & Bengio's and He et al.'s experiments showed that on a 30-layer ReLU network, Xavier init stalls (the gradient signal decays toward zero) while He init trains normally. The fix is literally the factor of 2.
- Init is essentially free. Initializing a 1024×1024 weight matrix means drawing ~1M random numbers — under a millisecond. A single forward+backward step on that same matrix at batch size 256 is on the order of hundreds of millions of multiply-adds. Init cost is rounding error against training cost.
- Biases start at zero. Unlike weights, zero biases are correct and standard — the weights already break symmetry, so there is nothing to gain from random biases (and a small constant like 0.01 is sometimes used for ReLU to avoid dead units at step 0).
JavaScript implementation
// Box–Muller: one standard-normal sample N(0, 1)
function randn() {
let u = 0, v = 0;
while (u === 0) u = Math.random(); // avoid log(0)
while (v === 0) v = Math.random();
return Math.sqrt(-2 * Math.log(u)) * Math.cos(2 * Math.PI * v);
}
// Initialize a (fanOut × fanIn) weight matrix.
// mode: 'xavier' (tanh/sigmoid) or 'he' (ReLU)
// dist: 'normal' or 'uniform'
function initWeights(fanIn, fanOut, mode = 'he', dist = 'normal') {
const gain = mode === 'he' ? 2 : 1; // He doubles the variance
const variance = gain / fanIn; // forward-pass preserving
const W = [];
for (let o = 0; o < fanOut; o++) {
const row = [];
for (let i = 0; i < fanIn; i++) {
if (dist === 'uniform') {
// U(-limit, +limit) has variance = limit² / 3, so:
const limit = Math.sqrt(3 * variance);
row.push((Math.random() * 2 - 1) * limit);
} else {
row.push(randn() * Math.sqrt(variance)); // std = sqrt(Var)
}
}
W.push(row);
}
return W; // biases are initialized to 0
}
// Sanity check: empirical variance should land near `gain / fanIn`.
const W = initWeights(512, 256, 'he');
const flat = W.flat();
const mean = flat.reduce((a, b) => a + b, 0) / flat.length;
const v = flat.reduce((a, b) => a + (b - mean) ** 2, 0) / flat.length;
console.log(v.toFixed(5), 'vs target', (2 / 512).toFixed(5)); // ≈ 0.00391
Two details worth flagging. First, the standard deviation passed to the sampler is Math.sqrt(variance), not the variance itself — a common off-by-a-square-root bug that silently makes weights far too large. Second, the uniform form needs limit = sqrt(3 · Var) because a uniform distribution on [-L, L] has variance L²/3, not L².
Python implementation
import numpy as np
def init_weights(fan_in, fan_out, mode="he", dist="normal", rng=None):
"""Return a (fan_out, fan_in) weight matrix; biases stay zero."""
rng = rng or np.random.default_rng()
gain = 2.0 if mode == "he" else 1.0 # He doubles the variance for ReLU
variance = gain / fan_in # forward-pass-preserving
if dist == "uniform":
limit = np.sqrt(3 * variance) # U(-L, L) has variance L**2 / 3
return rng.uniform(-limit, limit, size=(fan_out, fan_in))
return rng.normal(0.0, np.sqrt(variance), size=(fan_out, fan_in))
# --- Famous demonstration: signal variance through a deep net -------------
# Show why He beats Xavier on a deep ReLU stack. Track Var(activations)
# layer by layer for a 30-layer network and watch one vanish, one survive.
def trace_variance(mode, depth=30, width=512, rng=None):
rng = rng or np.random.default_rng(0)
x = rng.normal(0, 1, size=(width, 1)) # unit-variance input
variances = []
for _ in range(depth):
W = init_weights(width, width, mode=mode, rng=rng)
x = np.maximum(0, W @ x) # ReLU
variances.append(float(x.var()))
return variances
he = trace_variance("he")
xav = trace_variance("xavier")
print("He layer-30 variance:", round(he[-1], 4)) # stays O(1)
print("Xav layer-30 variance:", round(xav[-1], 6)) # decays toward 0
# PyTorch ships these directly:
# torch.nn.init.kaiming_normal_(w, nonlinearity="relu") # He
# torch.nn.init.xavier_uniform_(w) # Glorot
# nn.Linear default IS kaiming_uniform_(a=sqrt(5)) # a quirky variant
trace_variance is the canonical experiment: run the same deep ReLU stack twice, once He-initialized and once Xavier-initialized, and watch the activation variance hold steady under He but slide toward zero under Xavier. That single plot is the entire argument for the factor of 2.
Variants worth knowing
Uniform vs normal. Both target the same variance. Xavier-uniform draws from U(-√(6/(fan_in+fan_out)), +√(6/(fan_in+fan_out))); He-uniform from U(-√(6/fan_in), +√(6/fan_in)). The 6 comes from 3 × 2 (the uniform-variance factor times He's gain). Frameworks default to uniform.
The gain parameter. PyTorch's calculate_gain generalizes the factor: 1 for tanh's linear region (actually 5/3 for tanh), √2 in standard-deviation terms for ReLU (i.e. variance 2), and a Leaky-ReLU gain of √(2/(1+a²)) for negative slope a.
Orthogonal initialization. Instead of i.i.d. random entries, make the weight matrix orthogonal (via QR or SVD). Orthogonal matrices preserve vector norms exactly, which helps very deep and recurrent nets where even He's approximate variance preservation drifts.
LSUV (Layer-Sequential Unit-Variance). Initialize orthogonally, then run a mini-batch forward and rescale each layer's weights so the measured output variance is exactly 1 — a data-dependent correction that fixes any residual drift the analytic formulas miss.
Residual / transformer scaling. In a residual stack the variance accumulates along the skip path, so the sum grows with depth even under He. Fixes include zero-initializing the last layer of each residual block (Fixup, ReZero, the "zero-gamma" trick in batch norm) and scaling by 1/√(2N) for an N-layer transformer (GPT-2's approach) so the residual sum stays unit-variance.
Common bugs and edge cases
- Passing variance where the sampler wants std.
np.random.normal(0, variance)instead ofnp.random.normal(0, sqrt(variance))makes weights wildly too large. The normal sampler's second argument is the standard deviation. - Using He with tanh or Xavier with ReLU. Not catastrophic shallow, but the factor-of-2 mismatch compounds with depth — He+tanh slowly saturates, Xavier+ReLU slowly vanishes.
- All-zeros (or any constant) weights. Every neuron in a layer becomes identical and stays identical — the symmetry never breaks. Zero biases are fine; zero weights are fatal.
- Confusing fan-in and fan-out. For a weight matrix shaped
(out, in), fan-in isin. Convolutions are trickier: fan-in =in_channels × kernel_h × kernel_w, not justin_channels. - Assuming batch norm makes init irrelevant. Batch norm masks a bad init after the first few steps, but the very first forward pass and any unnormalized layers still depend on it.
- Forgetting the residual path. A correct per-layer He init can still explode through a deep residual sum; you need depth-aware scaling (Fixup / 1/√(2N)) on top.
Frequently asked questions
Why can't I just initialize all the weights to zero?
Because every neuron in a layer would then compute the same output and receive the same gradient, so they update identically and stay identical forever — the symmetry never breaks. You need random weights so neurons learn different features. Zero biases are fine; zero weights are fatal.
What is the difference between Xavier and He initialization?
Both scale the weight variance by the layer's fan-in, but He uses twice the variance of Xavier. Xavier (Glorot, 2010) assumes a linear or tanh/sigmoid activation that preserves variance, so Var(W) = 1/fan_in. He (2015) accounts for ReLU zeroing out half its inputs, so it doubles the variance to Var(W) = 2/fan_in to compensate. Use Xavier with tanh/sigmoid, He with ReLU and its variants.
What is fan-in and fan-out?
Fan-in is the number of inputs to a neuron (the previous layer's width); fan-out is the number of outputs it feeds (the next layer's width). For a weight matrix shaped (out_features, in_features), fan_in = in_features and fan_out = out_features. Xavier's original formula uses 2/(fan_in + fan_out) to balance the forward and backward passes; the common 1/fan_in form only balances the forward pass.
Does weight initialization still matter if I use batch normalization?
It matters less, because batch norm rescales activations to unit variance at every layer, papering over a bad init. But init still affects the very first forward pass, the early-training dynamics before the norm statistics settle, and any layers that lack a norm. He init plus batch norm is the standard recipe for deep CNNs precisely because they are complementary, not redundant.
What happens if the weights are too large or too small?
Too large and the activation variance grows by a constant factor at every layer, so a 50-layer net blows up exponentially — exploding activations and NaN gradients. Too small and the variance shrinks the same way, so by the output layer the signal has vanished to near-zero and gradients are too tiny to train. Correct init keeps the per-layer variance factor at exactly 1.
Should I use the uniform or normal version of Xavier/He?
Both target the same variance, so it rarely matters in practice. The normal version samples from N(0, Var); the uniform version samples from U(-limit, +limit) where limit = sqrt(3 · Var) so that the uniform distribution has exactly that variance. Frameworks default to the uniform form (PyTorch's kaiming_uniform_ is the default Linear init); the normal form is marginally heavier-tailed.