Information Theory
Cross-Entropy
The bits you actually spend when your code is built for the wrong distribution
H(P, Q) = −Σ P(x) log Q(x) — average bits to code P with Q's code. Equals H(P) plus KL: floor plus overhead. Default loss of every classifier.
- DefinitionH(P, Q) = −Σ P(x) log Q(x)
- DecompositionH(P, Q) = H(P) + D_KL(P‖Q)
- Lower boundH(P, Q) ≥ H(P), equality iff Q = P
- Binary form−y log ŷ − (1 − y) log(1 − ŷ)
- Softmax gradient∂L/∂z_k = ŷ_k − y_k (clean)
- Used inClassifiers, GLMs, language models, RL, distillation
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
From entropy to cross-entropy
Shannon's source-coding theorem says: if symbols are drawn from a distribution P, the smallest average codeword length you can achieve is the entropy H(P) = −Σ P(x) log P(x). To hit that lower bound you assign about −log P(x) bits to each outcome — the rare ones get long codewords, the common ones get short ones. Now suppose you build a code for the wrong distribution Q. You still assign −log Q(x) bits to each outcome, but when those outcomes come from P, the average length is
H(P, Q) = − Σ_x P(x) · log Q(x).
This is the cross-entropy. It is what you actually spend per symbol; the entropy H(P) is what you would have spent if you knew P. The gap is exactly the KL divergence:
H(P, Q) = H(P) + D_KL(P ‖ Q).
Because D_KL ≥ 0, the cross-entropy is bounded below by the entropy — with equality if and only if Q = P. When P is the empirical distribution of training data and Q is the model output, minimising cross-entropy over the model parameters is identical to minimising KL — H(P) is a data-only constant that vanishes in the gradient.
Worked example — a four-class classifier
You have a single training example whose true class is "cat". The one-hot label is P = (1, 0, 0, 0) for the four classes (cat, dog, fish, bird). The model produces logits z = (2.0, 1.0, 0.1, −1.0). Apply softmax:
exp(z) = (7.389, 2.718, 1.105, 0.368)
sum = 11.580
Q = (0.638, 0.235, 0.095, 0.032).
Cross-entropy under P = one-hot picks out only the true class term, since the other P values are zero:
H(P, Q) = − log Q(cat) = − log 0.638 ≈ 0.450 nats
= 0.450 / ln 2 ≈ 0.649 bits.
So this example costs 0.450 nats of loss. The gradient w.r.t. the logits is simply Q − P = (−0.362, 0.235, 0.095, 0.032), which says: push the cat logit up, push the others down. That clean ŷ − y form is the gradient through softmax-plus-cross-entropy and is why this combination dominates the field.
What if the model became more confident? Suppose z = (5.0, 1.0, 0.1, −1.0). Then Q(cat) = 0.973 and H(P, Q) = −log 0.973 ≈ 0.027 nats — a 17× reduction in loss for a 3-unit logit increase. Confident-right is cheap; what gets punished is confident-wrong.
Binary cross-entropy
The two-class special case is so common that it has its own name. With label y ∈ {0, 1} and prediction ŷ ∈ (0, 1) from a sigmoid:
BCE(y, ŷ) = − [ y · log ŷ + (1 − y) · log(1 − ŷ) ].
One term contributes when y = 1; the other when y = 0. With a sigmoid output, the gradient w.r.t. the logit z is ŷ − y, mirroring the multi-class case. PyTorch ships two implementations: BCELoss expects ŷ ∈ (0, 1), while BCEWithLogitsLoss takes raw logits and fuses the sigmoid with the log for numerical stability via log-sum-exp. Always prefer the latter — never apply torch.sigmoid by hand and then pass to BCELoss, because the unfused version underflows when logits are large negative.
JavaScript — implementing cross-entropy
// Numerically stable softmax — shift by max logit
function softmax(z) {
const m = Math.max(...z);
const exp = z.map(zi => Math.exp(zi - m));
const sum = exp.reduce((s, e) => s + e, 0);
return exp.map(e => e / sum);
}
// Cross-entropy with one-hot label: just the negative log of the true-class prob
function crossEntropy(yTrueIndex, qProbs) {
const p = qProbs[yTrueIndex];
// log_softmax in production: avoids the explicit log of softmax
return -Math.log(Math.max(p, 1e-12));
}
// Binary cross-entropy
function bce(y, yHat) {
const eps = 1e-12;
yHat = Math.min(Math.max(yHat, eps), 1 - eps);
return -(y * Math.log(yHat) + (1 - y) * Math.log(1 - yHat));
}
// Worked example
const logits = [2.0, 1.0, 0.1, -1.0];
const Q = softmax(logits);
console.log(Q); // [0.638, 0.235, 0.095, 0.032]
console.log(crossEntropy(0, Q)); // 0.450 nats
console.log(crossEntropy(0, Q) / Math.LN2); // 0.649 bits
// Confident-right vs confident-wrong
console.log(bce(1, 0.99)); // 0.01 — confident right, tiny loss
console.log(bce(1, 0.01)); // 4.61 — confident wrong, huge loss
Why softmax + cross-entropy is canonical
Three properties make this combination dominate every benchmark since 2012's ImageNet:
- Probabilistic interpretation. The softmax output Q is a categorical distribution; the loss −log Q(true class) is the negative log-likelihood of that distribution at the observed label. Training is maximum likelihood estimation in disguise — you inherit Fisher's consistency and asymptotic efficiency guarantees for free.
- Clean gradient. The chain rule through (softmax ∘ −log) cancels the exponentials. ∂L/∂z_k = Q_k − y_k where y is the one-hot label. No vanishing gradient on confident wrong answers, no spurious flat regions, no complicated chain-rule derivative — just predicted-minus-true.
- Calibration. Models trained with cross-entropy produce probabilities that approximately match observed frequencies (especially after temperature scaling). Mean-squared-error classifiers, by contrast, produce over-confident outputs because the gradient flattens far from the target — no loss-shaped mechanism forces the model to honestly report uncertainty.
The combination was popularised by Hinton's group in the 1990s, formalised in modern form by Bridle's 1989 "Probabilistic interpretation of feedforward classification network outputs", and became standard with the AlexNet 2012 ImageNet paper. Every state-of-the-art classifier of the past decade — VGG, ResNet, ViT, DINO, CLIP, transformer language models, diffusion classifiers — uses softmax-with-cross-entropy by default.
Where cross-entropy shows up
- Image classification. ResNet, EfficientNet and Vision Transformers all train on categorical cross-entropy against ImageNet labels. Top-1 accuracy is the standard headline number; the loss being minimised is cross-entropy.
- Language modelling. Next-token prediction in transformer LMs — GPT, LLaMA, Mistral — is categorical cross-entropy over the vocabulary at every position. Perplexity, the standard metric, is exp(average cross-entropy per token).
- Logistic regression. Binary cross-entropy is the negative log-likelihood of the Bernoulli GLM. R's
glm(family = binomial)and Python'sLogisticRegressionboth minimise BCE via Newton or IRLS. - Speech recognition (CTC, RNN-T). Connectionist Temporal Classification and recurrent transducer losses are cross-entropy in disguise, summed over all valid alignments of the predicted distribution against the reference transcript.
- Reinforcement learning with policy gradients. The actor's loss in REINFORCE, A2C, PPO is the cross-entropy of the policy against an empirical advantage-weighted target distribution. Maximising expected return is equivalent to minimising a cross-entropy with importance weights.
- Knowledge distillation. A student model trains against soft targets from a teacher (a temperature-softened teacher softmax). The loss is cross-entropy between teacher and student distributions — Hinton's 2015 paper formalised it as a standard technique for model compression.
Common pitfalls
- Computing softmax then log separately. Each step can underflow or overflow. Always use the fused
log_softmaxorCrossEntropyLosson raw logits. The library handles log-sum-exp shift internally. - Mixing label conventions. Some frameworks expect integer class indices (PyTorch's
CrossEntropyLoss); others expect one-hot vectors (TensorFlow'scategorical_crossentropy). Mismatched assumptions silently train against the wrong target. - Forgetting class imbalance. Cross-entropy gives every sample equal weight; with severe imbalance the model collapses onto the majority class. Use class weights, focal loss, or resampling to compensate.
- Treating cross-entropy as accuracy. A model can lower its cross-entropy while accuracy stalls — by becoming more confident on already-correct predictions. Always track accuracy and calibration alongside the loss.
- BCE on multi-label tasks vs softmax cross-entropy. Multi-label classification (each image can have several tags) requires independent sigmoids and BCE per output — not a softmax. Using softmax forces the outputs to sum to 1, which is wrong for non-exclusive labels.
- Label smoothing without retuning. Replacing one-hot labels with (1−ε, ε/(K−1), …) prevents over-confidence but lowers the achievable cross-entropy floor (the entropy of the smoothed distribution). Comparing raw losses across smoothed vs unsmoothed runs is meaningless without correction.
Cross-entropy vs alternative classification losses
| Cross-entropy | MSE on probabilities | Hinge loss | Focal loss | Label smoothing CE | |
|---|---|---|---|---|---|
| Probabilistic interpretation | Yes (NLL) | No | No | Modified NLL | NLL on smoothed target |
| Gradient on confident-wrong | Large (good) | Vanishes (bad) | Constant magnitude | Up-weighted | Slightly reduced |
| Calibration | Reasonable | Poor (over-confident) | Not probabilistic | Tunable | Better than CE |
| Handles class imbalance natively | No | No | No | Yes (γ parameter) | No |
| Convex in last-layer weights | Yes | Yes | Yes | No | Yes |
| Best for | Standard classification | Don't — use CE | Margin-based, SVMs | Detection (RetinaNet) | Regularising over-confident models |
Cross-entropy is the default for a reason: it has a probabilistic interpretation, plays cleanly with softmax, calibrates reasonably and converges fast. The other entries in the table are specialisations — hinge for SVMs, focal for object detection where most samples are easy negatives, label smoothing for over-confident models with abundant data.
Frequently asked questions
Why is cross-entropy the standard loss for classification?
Three reasons converge. First, it is exactly the negative log-likelihood of a softmax-parameterised categorical model, so minimising it is maximum likelihood — you inherit Fisher's consistency, asymptotic normality and efficiency. Second, when paired with softmax, the gradient through the logits is just (ŷ − y), the predicted-minus-true vector. No vanishing-gradient pathology, no exotic numerics — a perfect downstream signal for backpropagation. Third, cross-entropy is calibrated: it penalises confident wrong answers exponentially harder than uncertain ones, so a model trained with cross-entropy produces probabilities that match observed frequencies after training, while mean-squared-error classifiers produce overconfident and uncalibrated outputs.
What is the difference between cross-entropy and KL divergence?
H(P, Q) = H(P) + D_KL(P‖Q). Cross-entropy is the full coding cost. KL divergence is the surcharge above the entropy of P — the part you can drive to zero by improving Q. When training a classifier, the entropy H(P) of the data is a constant (depends only on the labels), so minimising cross-entropy over the model parameters is exactly minimising KL divergence. The two losses differ only by an additive constant of the data, never by behaviour during optimisation.
What is binary cross-entropy?
The two-class special case. With one Bernoulli output ŷ ∈ (0, 1) predicting the probability of class 1, and a label y ∈ {0, 1}, BCE is −[y log ŷ + (1 − y) log(1 − ŷ)]. Equivalent to a sigmoid output plus log-loss. PyTorch's BCELoss and BCEWithLogitsLoss compute exactly this; in the latter, the sigmoid and log are fused for numerical stability (log-sum-exp trick) — never use BCELoss after a manual sigmoid for the same reason.
Why do labels need to be one-hot?
They don't, strictly. With one-hot labels, cross-entropy reduces to a single term −log Q(true class) — the negative log-likelihood — which is what cross-entropy loss is usually implemented as. With soft labels (a probability distribution over classes), all terms contribute and cross-entropy genuinely matches the soft target. Label smoothing, knowledge distillation and Mixup all use soft labels. The mathematical object is the same; the dense vs sparse implementation differs.
Why is cross-entropy combined with softmax in deep learning?
Because the gradient simplifies spectacularly. Softmax pushes logits z through exp/Σexp to get probabilities; cross-entropy then takes −log of the predicted probability of the true class. When you compute the gradient of (softmax + cross-entropy) w.r.t. the logits, all the exponentials cancel and you get ∂L/∂z_k = ŷ_k − y_k. No vanishing gradient, no numerical underflow, no chain-rule mess. PyTorch's nn.CrossEntropyLoss fuses softmax and log inside log_softmax for numerical stability — it expects raw logits, not probabilities.
What happens if Q assigns zero probability to the true class?
Cross-entropy blows up. −log 0 = ∞. The model is infinitely sure the data is impossible, and the loss reflects that. In practice every softmax output is strictly positive (the exponentials are positive, the denominator is positive), so the literal zero never occurs — but small probabilities give large losses, which is exactly what you want during training. The hard case is mixed-precision arithmetic where exp(−large) underflows to 0; numerically stable implementations subtract the maximum logit before exponentiating (the log-sum-exp trick) to prevent this.
How does cross-entropy compare to mean squared error for classification?
MSE is calibration-blind and trains slowly. With a softmax output and a one-hot target, MSE penalises being wrong but flattens for very confident wrong answers — the gradient ŷ(1−ŷ)(ŷ − y) vanishes when ŷ → 1 on the wrong class, leaving the model stuck. Cross-entropy keeps the gradient large for confident mistakes (the magnitude does not flatten). Empirically, cross-entropy converges 5–10× faster than MSE for classification and produces better-calibrated probabilities. Use MSE for regression; use cross-entropy for classification.