Machine Learning
Softmax & Cross-Entropy
Turn raw scores into probabilities, then measure how wrong they are
Softmax turns a vector of raw scores (logits) into a probability distribution; cross-entropy measures how far that distribution is from the true label. Together they form the standard loss for multi-class classification, with a gradient that simplifies to prediction minus target.
- Softmax forward costO(K)
- Outputssum to 1, each in (0,1)
- Fused gradientp − y
- Loss when right→ 0
- Loss when confidently wrong→ ∞
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
From scores to a verdict
A neural network classifier ends with a layer that spits out one number per class — the logits. For a 1000-class ImageNet model that's a vector of 1000 raw scores like [2.1, -0.4, 5.8, …]. These numbers are unbounded, can be negative, and don't sum to anything meaningful. Two jobs remain: turn them into probabilities you can act on, and turn the gap between your guess and the truth into a single number you can minimize.
Softmax does the first job. It exponentiates every logit (making them all positive and amplifying the gaps) and then normalizes by the sum, so the outputs land in (0, 1) and add up to exactly 1 — a genuine probability distribution over the classes:
exp(zᵢ)
softmax(z)ᵢ = ─────────────
Σⱼ exp(zⱼ)
The exponential is the whole personality of softmax. Because exp grows fast, a logit that's only 2 larger than its neighbor ends up roughly e² ≈ 7.4× more probable. The biggest logit wins decisively but never claims 100% — every class keeps a sliver, which is what makes the function differentiable everywhere (unlike a hard argmax).
Cross-entropy does the second job. Given the predicted distribution p and the true distribution y (usually one-hot — a 1 on the correct class, 0 elsewhere), it measures the average surprise of the truth under your prediction:
H(y, p) = − Σᵢ yᵢ · log(pᵢ)
With a one-hot label, all but one term vanishes and the loss collapses to −log(p_correct) — the negative log-likelihood of the right answer. Predict the correct class with probability 1 and the loss is −log(1) = 0. Predict it with probability 0.01 and you pay −log(0.01) ≈ 4.6. Confidently wrong is punished without mercy: as p_correct → 0, the loss heads to infinity.
Why the two are always fused
Softmax and cross-entropy are conceptually separate, but every framework — PyTorch's CrossEntropyLoss, TensorFlow's softmax_cross_entropy_with_logits — fuses them into one op. The reason is the gradient.
The Jacobian of softmax alone is a dense K×K matrix: ∂pᵢ/∂zⱼ = pᵢ(δᵢⱼ − pⱼ), where δᵢⱼ is 1 when i = j and 0 otherwise. The derivative of cross-entropy with respect to p brings a 1/p. When you chain them through the loss L = −log(p_correct), the pᵢ in the Jacobian cancels the 1/pᵢ from the log, and the whole thing collapses to one of the most satisfying results in deep learning:
∂L
─── = pᵢ − yᵢ (prediction minus target)
∂zᵢ
That's it. The gradient flowing back into the logits is just "how much you predicted minus how much you should have." It's cheap (O(K), one subtraction per class), numerically stable, and never vanishes when you're confidently wrong — exactly the property that keeps training moving. Computing softmax and cross-entropy as two separate steps would mean materializing that K×K Jacobian and risking log(0) = −∞ when a probability underflows. Fusing them sidesteps both problems.
Numerical stability. The naive forward pass blows up because exp overflows: exp(1000) is inf in float64, and inf/inf is NaN. The fix is the log-sum-exp trick: subtract the maximum logit from every element before exponentiating. The largest exponent becomes exp(0) = 1, the result is identical (the shared factor cancels top and bottom), and you never overflow.
When to reach for softmax + cross-entropy
- Single-label, multi-class classification — exactly one class is correct per example (digit recognition, ImageNet, intent classification). This is the canonical use.
- The output layer of any classifier trained by gradient descent, from logistic regression (the 2-class special case) to a 1000-way ImageNet head to a transformer's vocabulary projection.
- Language-model token prediction — the final layer scores every token in a 50,000-word vocabulary; softmax turns those into a next-token distribution and cross-entropy is the training loss.
- Attention weights — softmax over query·key scores produces the attention distribution inside every transformer.
Reach for something else when the classes aren't mutually exclusive (use independent sigmoids for multi-label), when you have a regression target (use MSE), or when the class count is astronomical — a 1-million-token vocabulary makes the normalization sum expensive, which is why hierarchical softmax and sampled-softmax variants exist.
Softmax + cross-entropy vs the alternatives
| Softmax + CE | Sigmoid + BCE | Softmax + MSE | Hinge (SVM) | Sampled softmax | |
|---|---|---|---|---|---|
| Class assumption | Mutually exclusive | Independent (multi-label) | Mutually exclusive | Mutually exclusive | Mutually exclusive |
| Outputs sum to 1? | Yes | No | Yes | No (raw margins) | Yes (approx.) |
| Gradient w.r.t. logits | p − y (clean) | σ(z) − y per class | vanishes when wrong+saturated | 0 inside margin | p − y on sampled subset |
| Probabilistic output | Yes (calibrated-ish) | Yes, per class | Yes but ill-fit | No | Yes (approximate) |
| Cost per example | O(K) | O(K) | O(K) | O(K) | O(samples), K-independent |
| Handles huge K | Slow (full sum) | n/a | Slow | Slow | Built for it |
| Typical use | ImageNet, LM tokens, attention | Tagging, multi-label vision | Rarely — legacy | Classic SVMs | Word2Vec, huge-vocab LMs |
The headline contrast is softmax-CE versus softmax-MSE. Both produce a distribution, but only cross-entropy gives the non-vanishing p − y gradient. MSE on softmax outputs multiplies in the softmax Jacobian, so when a wrong class sits near 0 or 1 its gradient is tiny and learning plateaus. This is the practical reason classification standardized on cross-entropy decades ago.
What the numbers actually say
- A logit gap of 2 means ≈7.4× the probability. Softmax differences are exponential, not linear: logits
[3, 1]map to probabilities[0.88, 0.12], becausee² ≈ 7.39. - Confidently wrong costs ≈4.6 in loss. A 1% probability on the true class gives
−ln(0.01) = 4.605; a 0.1% probability gives 6.9. There's no ceiling, so a single mislabeled example with a confident model can dominate a batch. - Random guessing over K classes has loss ln(K). A freshly initialized 1000-class model should report a loss near
ln(1000) ≈ 6.91. If your initial loss is wildly off that, you have a bug — this is a standard sanity check. - The max-subtraction trick is free. It adds one pass to find the max and one subtraction per element — O(K) extra on an already-O(K) op — and converts guaranteed overflow at logit ≈ 710 (float64's
explimit) into never overflowing. - Vocabulary softmax dominates LM cost at scale. For a 50,000-token vocabulary and a 1024-dim hidden state, the final projection plus softmax is 50,000 × 1024 ≈ 51M multiply-adds per token — often the single largest layer, which is why sampled/hierarchical softmax exist.
JavaScript implementation
The numerically stable softmax, plus the fused softmax-cross-entropy that returns both the scalar loss and the p − y gradient in one pass:
// Stable softmax: subtract the max before exp() to avoid overflow.
function softmax(logits) {
const max = Math.max(...logits);
const exps = logits.map(z => Math.exp(z - max));
const sum = exps.reduce((a, b) => a + b, 0);
return exps.map(e => e / sum);
}
// Fused softmax + cross-entropy for a one-hot label index `target`.
// Returns the scalar loss and dL/dz = p - y (the gradient on the logits).
function softmaxCrossEntropy(logits, target) {
const max = Math.max(...logits);
// log-sum-exp computed in the shifted domain for stability
let sumExp = 0;
for (const z of logits) sumExp += Math.exp(z - max);
const logSumExp = max + Math.log(sumExp);
// loss = -log p_target = -(z_target - logSumExp)
const loss = logSumExp - logits[target];
// gradient: p_i - y_i. y is one-hot, so subtract 1 only at the target.
const grad = logits.map(z => Math.exp(z - logSumExp)); // this is p_i
grad[target] -= 1;
return { loss, grad };
}
// Example: 3 classes, true class = index 0
const { loss, grad } = softmaxCrossEntropy([2.0, 1.0, 0.1], 0);
// loss ≈ 0.417, grad ≈ [-0.341, 0.242, 0.099] (sums to 0)
Note that we never call softmax() and then log() separately — that's where log(0) bites. We compute log-sum-exp directly and read the loss as logSumExp − z_target. The gradient components always sum to zero, which is a handy invariant to assert in tests.
Python / NumPy implementation
import numpy as np
def softmax(z, axis=-1):
z = z - np.max(z, axis=axis, keepdims=True) # stability shift
e = np.exp(z)
return e / np.sum(e, axis=axis, keepdims=True)
def log_softmax(z, axis=-1):
# log p without ever forming p — avoids log(0)
m = np.max(z, axis=axis, keepdims=True)
return z - m - np.log(np.sum(np.exp(z - m), axis=axis, keepdims=True))
def cross_entropy(logits, targets):
"""logits: (N, K) ; targets: (N,) integer class indices.
Returns mean loss and the per-logit gradient dL/dz of shape (N, K)."""
N = logits.shape[0]
logp = log_softmax(logits, axis=1)
loss = -logp[np.arange(N), targets].mean() # mean NLL of true class
p = np.exp(logp) # softmax probabilities
grad = p.copy()
grad[np.arange(N), targets] -= 1 # p - y
grad /= N # average over the batch
return loss, grad
# temperature scaling: T < 1 sharpens, T > 1 flattens
def softmax_T(z, T=1.0, axis=-1):
return softmax(z / T, axis=axis)
logits = np.array([[2.0, 1.0, 0.1],
[0.5, 2.5, 0.3]])
targets = np.array([0, 1])
loss, grad = cross_entropy(logits, targets)
print(loss) # ~0.32
print(grad.sum()) # ~0.0 (each row's grad sums to zero)
Real frameworks pass logits straight into the loss — torch.nn.CrossEntropyLoss expects raw logits, not probabilities, and applies log_softmax internally. Feeding it post-softmax values is one of the most common beginner bugs: you softmax twice and training silently underperforms.
Variants worth knowing
Temperature scaling. Divide logits by a temperature T before softmax. T < 1 sharpens the distribution toward a hard argmax; T > 1 flattens it toward uniform. This is the knob behind LLM sampling diversity and behind Hinton's knowledge distillation, where a soft, high-temperature teacher distribution carries more information than a hard label.
Label smoothing. Replace the one-hot target with 1 − ε on the true class and ε/(K−1) elsewhere (typically ε = 0.1). It stops the model from chasing infinite logits to reach probability 1, improves calibration, and gave a measurable ImageNet accuracy bump in the Inception-v3 work.
Log-softmax + NLL loss. Mathematically identical to softmax + cross-entropy, but computing log p directly (without forming p) is more stable. PyTorch exposes this split as log_softmax followed by nll_loss; CrossEntropyLoss is just their composition.
Sampled / hierarchical softmax. When K is in the millions (word vocabularies), the normalizing sum dominates cost. Sampled softmax approximates the denominator using a handful of negative samples; hierarchical softmax replaces the flat sum with a binary tree, dropping the per-token cost from O(K) to O(log K).
Sparsemax and entmax. Drop-in replacements that can assign exactly zero probability to some classes (softmax never can). Useful when you want sparse, interpretable attention or output distributions.
Common bugs and edge cases
- Applying softmax twice. Passing already-softmaxed probabilities into
CrossEntropyLoss(which expects logits) re-normalizes a distribution and flattens the gradient. Feed raw logits. - Forgetting the max-subtraction. A textbook
exp(z)/sum(exp(z))overflows toNaNfor logits above ≈710 in float64 (≈88 in float32). Always shift by the max. - log(0) from a separate softmax then log. If a probability underflows to exactly 0,
log(0) = −∞. Uselog_softmaxor the fused loss instead of composing the two functions yourself. - Wrong axis on batched inputs. Softmax must normalize over the class dimension, not the batch dimension. Reducing over the wrong axis produces silently wrong probabilities that still sum to 1 along the wrong direction.
- Using softmax for multi-label. If an image can be both "cat" and "outdoor," softmax forces them to compete for one budget. Use per-class sigmoids with binary cross-entropy.
- Initial loss far from ln(K). A correctly wired K-class model starts near
ln(K). A starting loss of 0.01 or 50 signals a label, axis, or double-softmax bug — check before training for hours.
Frequently asked questions
Why subtract the max before exponentiating in softmax?
exp() overflows fast — exp(1000) is infinity in float64, and inf/inf is NaN. Subtracting the maximum logit from every element shifts the largest exponent to exp(0) = 1, which never overflows. The output is mathematically identical because the shared factor cancels in the numerator and denominator.
What is the gradient of softmax plus cross-entropy?
It collapses to prediction minus target: ∂L/∂zᵢ = pᵢ − yᵢ, where p is the softmax output and y is the one-hot label. The messy softmax Jacobian and the log in cross-entropy cancel, which is exactly why the two are fused into one operation in every deep-learning framework.
What's the difference between softmax and sigmoid?
Softmax produces a distribution over mutually exclusive classes — the outputs sum to 1, so raising one class lowers the others. Sigmoid squashes each logit independently to [0, 1] with no coupling, which is what you want for multi-label problems where several classes can be true at once.
Why use cross-entropy instead of mean squared error for classification?
Cross-entropy paired with softmax gives a clean (p − y) gradient that stays large when the prediction is confidently wrong, so learning doesn't stall. MSE on softmax outputs produces a gradient that vanishes when a wrong class is near 0 or 1, causing slow, plateau-prone training.
Is softmax temperature the same as the temperature in language models?
Yes — it's the same trick. Dividing logits by a temperature T before softmax controls sharpness: T < 1 makes the distribution peakier (more deterministic), T > 1 flattens it (more random), and T → ∞ approaches a uniform distribution. LLM sampling and knowledge distillation both rely on it.
Why does the cross-entropy loss only depend on the true class probability?
With a one-hot label, every term yᵢ·log(pᵢ) is zero except the one where yᵢ = 1. So the loss reduces to −log(p_correct): the negative log-likelihood of the correct class. Pushing p_correct toward 1 drives the loss toward 0; a confident wrong answer (p_correct → 0) sends the loss toward infinity.