Information Theory

Shannon Entropy

Average information per symbol — the 1948 number that built the digital age

H(X) = −Σ p(x) log p(x) — average surprise of a random symbol. Fair coin: 1 bit. Biased p = 0.9: ~0.47 bits. Maximum at uniform, zero at deterministic.

  • DefinitionH(X) = − Σ p(x) · log p(x)
  • Maximumlog K at the uniform distribution over K outcomes
  • Minimum0 at any deterministic distribution
  • AdditiveH(X, Y) = H(X) + H(Y) if X ⊥ Y
  • CoinedClaude Shannon, 1948
  • Unitsbits (log₂), nats (ln), dits (log₁₀)

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The 1948 formula

Claude Shannon's A Mathematical Theory of Communication appeared in two parts in the Bell System Technical Journal in 1948 and singlehandedly founded information theory. For a discrete random variable X with possible values x₁, …, x_K and probabilities p₁, …, p_K, Shannon defined the entropy

H(X) = − Σ_{i=1..K} p_i · log p_i.

The convention 0 · log 0 := 0 makes the formula well-defined at deterministic events. The base of the logarithm sets the units: log base 2 gives bits (the natural unit of information theory), natural log gives nats (the natural unit of calculus and machine learning), log base 10 gives dits or hartleys (older engineering literature). Conversion: 1 nat = log₂ e ≈ 1.4427 bits.

Shannon showed in the same paper that any functional H(p₁, …, p_K) satisfying three axioms — continuity in the p_i, monotonicity in K under the uniform distribution, and a chain-rule consistency for compound experiments — must be H = −C · Σ p_i log p_i for some positive constant C. Choosing log base 2 fixes C = 1 and gives the bit. Shannon entropy is not a definition; it is the unique measure of information up to a scaling constant.

Worked examples

A fair coin

P(heads) = P(tails) = 1/2. Plug in:

H = − [ 0.5 · log₂(0.5) + 0.5 · log₂(0.5) ]
  = − [ 0.5 · (−1) + 0.5 · (−1) ]
  = 1 bit.

Exactly one bit per flip. You need exactly one binary digit to communicate the outcome to a friend. This is the definition of a bit — the entropy of a fair binary choice.

A biased coin (p = 0.9 heads)

H = − [ 0.9 · log₂(0.9) + 0.1 · log₂(0.1) ]
  = − [ 0.9 · (−0.152)  + 0.1 · (−3.322)  ]
  =     0.137            + 0.332
  =     0.469  bits.

The biased coin carries about 0.47 bits per flip — less than half the fair coin. Because heads is almost certain, learning that heads came up is barely surprising; the rare tails carries most of the information. As p → 0 or p → 1 the entropy goes to zero — a deterministic outcome carries no information.

The full binary-entropy curve

For a binary distribution with bias p, the entropy is H(p) = −p log₂ p − (1−p) log₂(1−p). This is the famous binary-entropy function — a concave curve over [0, 1] with maximum 1 at p = 1/2 and zeros at p = 0 and p = 1. Symmetric about p = 1/2. It is the workhorse of coding theory and the binary-classification analogue of variance.

A four-class uniform distribution

Four outcomes, each probability 1/4. H = −4 · (1/4) · log₂(1/4) = log₂ 4 = 2 bits. The uniform distribution over K outcomes has entropy exactly log₂ K. Two bits suffice to identify one of four equiprobable outcomes; eight bits suffice for one of 256. This is the byte.

Three properties that determine everything

Maximum at uniform. H(p) ≤ log K with equality iff p is uniform. Proof by Jensen's inequality applied to the concave function log: E[log(1/p)] ≤ log E[1/p] = log K when p is the distribution being averaged over. Equality requires 1/p to be constant — i.e. uniform p. Practically: the uniform prior carries the most uncertainty, and so the most information when revealed.

Minimum at deterministic. H(p) = 0 iff p is a point mass at a single outcome. There is no surprise; the outcome is known in advance. Every other distribution has strictly positive entropy.

Additive under independence. If X ⊥ Y, then H(X, Y) = H(X) + H(Y). Two independent fair coins carry 2 bits of joint entropy. Dependence reduces entropy below the sum: H(X, Y) ≤ H(X) + H(Y) in general, with equality iff X and Y are independent. The shortfall is exactly the mutual information I(X; Y) = H(X) + H(Y) − H(X, Y).

The source-coding theorem — why entropy is the lower bound

Shannon's source-coding theorem says: for any prefix code over a discrete memoryless source X with entropy H bits, the expected codeword length L satisfies L ≥ H. Conversely, Huffman codes and arithmetic codes achieve L < H + 1 bit, and block coding tightens this to L → H asymptotically as the block length grows. Entropy is the asymptotic lower bound on lossless compression.

The implications. English text has an empirical character-level entropy of about 1.2–1.5 bits per character (Shannon estimated this himself with a clever pencil-and-paper experiment in 1951). ASCII spends 8 bits per character — about 5–6 bits of redundancy. gzip on English plain text compresses to ~3 bits per character; modern neural-language-model compressors like NNCP and DeepZip get close to the Shannon limit of 1.2 bits. There is a hard wall at the entropy — no algorithm can compress English below 1.2 bits per character without losing information.

JavaScript — computing entropy

// Discrete Shannon entropy in bits (replace LN2 by 1 for nats)
function entropy(probs) {
  let h = 0;
  for (const p of probs) {
    if (p > 0) h -= p * Math.log2(p);   // 0 · log 0 := 0
  }
  return h;
}

// Binary entropy function H(p) = -p log p - (1-p) log (1-p)
function binaryEntropy(p) {
  if (p === 0 || p === 1) return 0;
  return -(p * Math.log2(p) + (1 - p) * Math.log2(1 - p));
}

// Worked examples
console.log(entropy([0.5, 0.5]));               // 1.000  bit  — fair coin
console.log(entropy([0.9, 0.1]));               // 0.469  bits — biased coin
console.log(entropy([0.25, 0.25, 0.25, 0.25])); // 2.000  bits — 4-class uniform
console.log(binaryEntropy(0.5));                // 1.000  bit
console.log(binaryEntropy(0.5).toFixed(3));

// Empirical entropy from a string of symbols
function empiricalEntropy(symbols) {
  const counts = new Map();
  for (const s of symbols) counts.set(s, (counts.get(s) || 0) + 1);
  const n = symbols.length;
  const probs = [...counts.values()].map(c => c / n);
  return entropy(probs);
}

console.log(empiricalEntropy('hello world'.split(''))); // ~2.85 bits/char

Where Shannon entropy shows up

  • Lossless compression. ZIP, gzip, bzip2, zstd, modern neural compressors. The size of a compressed file converges to its entropy in bits times the number of symbols. Any compressor is, in effect, an entropy estimator.
  • Channel coding. Shannon's noisy-channel theorem says you can transmit information at any rate strictly below the channel capacity C (a maximum mutual information) with arbitrarily low error probability. Modern error-correcting codes (LDPC, polar codes) approach C; 5G and Wi-Fi 6 are direct applications.
  • Decision trees. ID3, C4.5 and CART choose splits to maximise information gain — the reduction in conditional entropy of the label distribution. The XGBoost and LightGBM trees do the same in their split-finding routines.
  • Maximum-entropy modelling. When you know constraints (mean, variance, marginal distributions) but not the full distribution, pick the maximum-entropy distribution consistent with the constraints. This recovers Gaussian for fixed variance, exponential for fixed mean on the positive reals, and Boltzmann for fixed mean energy — the bridge to statistical mechanics.
  • Cryptography. A password's strength is measured in bits of entropy. A truly random 64-bit key has 64 bits of entropy; a user-chosen 12-character password may have only 20–40 bits because of correlations and predictable substrings. Diceware passphrases (6 random dice = 12.9 bits per word) are designed to be entropy-quantifiable.
  • Genetics and biology. Information content of DNA sequences, position weight matrices in motif discovery, entropy of phylogenetic trees. Mutual information between residues in a protein family reveals coevolving sites — used in modern contact-prediction and AlphaFold ancestry.

Common pitfalls

  • Confusing entropy with variance. Variance measures spread of numerical values; entropy measures uncertainty over labels (or values, in continuous form). A uniform distribution over {0, 1} and over {0, 1000} have the same entropy (1 bit) but very different variances. Entropy is invariant under relabelling — variance is not.
  • Estimating entropy from small samples. The naive plug-in estimator Ĥ = −Σ p̂ log p̂ from sample frequencies is biased downward by roughly (K − 1)/(2n) nats. For K = 100 categories and n = 1000 samples, the bias is ~0.05 nats — significant in many applications. Use the Miller-Madow correction, the Chao-Shen estimator, or Bayesian estimators.
  • Confusing differential entropy with discrete entropy. For continuous distributions, h(X) = −∫ f log f dx is not a direct generalisation — it can be negative and changes under unit changes. Treat it as a relative quantity. Use entropy rate for time series and discrete entropy for finite alphabets.
  • Picking the wrong log base. A "bit" in PyTorch's torch.distributions.Categorical(...).entropy() is actually a nat (natural log default). Conversion error of factor 1.44 is the most common entropy mistake in published code.
  • Ignoring the support assumption. Entropy is well-defined only over a fixed alphabet. Comparing entropies of distributions over different alphabets requires care — the maximum entropy log K depends on K.
  • Using entropy where mutual information is needed. H(X) measures the uncertainty of X alone. To measure how much information X carries about Y you need I(X; Y) = H(X) − H(X|Y). Many practitioners report H(X) when the meaningful quantity is mutual information.

Shannon entropy vs alternative entropies

ShannonRényiTsallisDifferentialMin-entropy
Definition−Σ p log p(1−α)⁻¹ log Σ p^α(α−1)⁻¹(1−Σ p^α)−∫ f log f dx−log max p
Additive over independent variablesYesYesNo (q-deformed)YesNo (max is not multiplicative)
Maximum at uniformYesYesYesDepends on supportYes
Recovers Shannon at α=1Yes (limit)Yes (limit)No
Used forCoding, ML, statisticsDiversity indices, cryptanalysisNon-extensive statistical mechanicsGaussian channels, VICryptography (worst-case)
Best whenAverage-case behaviour over many trialsSpecific moment-based diversityLong-range correlations or fractalsContinuous random variablesAdversarial security guarantees

Shannon's H is the unique entropy with the additivity + chain-rule properties Shannon required. The Rényi family interpolates between Shannon (α=1), Hartley (α=0, just log of support size) and min-entropy (α=∞). Tsallis allows non-additivity for systems with long-range correlations (popular in physics). Differential entropy is the continuous analogue. Min-entropy is the worst-case quantity used in cryptography. For most ML and statistics, Shannon is the right choice.

Frequently asked questions

Why is entropy of a fair coin exactly one bit?

Take p = 0.5 for both outcomes. H = −[0.5 log₂ 0.5 + 0.5 log₂ 0.5] = −[0.5 · (−1) + 0.5 · (−1)] = 1 bit. The choice of log base 2 makes the unit a "bit" — a binary digit's worth of information. Intuitively, you need exactly one binary digit to tell someone whether the coin came up heads or tails when both are equally likely. The bit is defined as the entropy of a fair binary choice — it is not a coincidence.

Why is entropy maximised at the uniform distribution?

Use Lagrange multipliers to maximise H = −Σ p_i log p_i subject to Σ p_i = 1. Setting ∂/∂p_i = 0 gives log p_i + 1 = λ for all i, so all p_i are equal, p_i = 1/K. Then H_max = log K. Intuitively, the uniform distribution gives no hint about which outcome will occur — every symbol is equally surprising, so the average surprise is highest. Any concentration of probability mass on a single outcome reduces the average surprise — once you know one outcome is much more likely, you're rarely surprised.

What's the difference between bits, nats, and dits?

They are the same quantity in different log bases. Log base 2 gives bits (information theory, coding). Natural log gives nats (calculus, machine learning, neuroscience). Log base 10 gives dits or hartleys (older engineering literature). Conversion: 1 nat = log₂ e ≈ 1.4427 bits, 1 dit ≈ 3.322 bits. PyTorch's torch.distributions.entropy() returns nats; scipy.stats.entropy() defaults to nats but accepts base=2; information-theory textbooks (Cover & Thomas, MacKay) use bits.

Why is entropy additive for independent variables?

If X and Y are independent, p(x, y) = p(x)p(y). The joint entropy: H(X, Y) = −Σ p(x)p(y)[log p(x) + log p(y)] = H(X) + H(Y). So two independent fair coins carry exactly 2 bits of entropy — one bit each. This additivity is one of the three axioms Shannon imposed in 1948, and (together with continuity and monotonicity) it uniquely determines the H = −Σ p log p form. Any other functional would either violate additivity, fail continuity, or contradict the chain rule for compound experiments.

What does Shannon's source-coding theorem actually say?

If you encode a sequence of n i.i.d. symbols drawn from a distribution with entropy H, the smallest expected total codeword length grows as n · H bits. No code can beat this rate asymptotically (the converse direction), and codes that achieve it exist (the achievability direction — Huffman codes, arithmetic codes). Practically — the entropy of English text is about 1.2 bits per character; gzip on plain ASCII achieves ~3 bits per character on English; modern neural-language-model compressors approach the true 1.2-bit Shannon limit. Entropy is the irreducible cost of representing the distribution.

How does Shannon entropy relate to thermodynamic entropy?

Shannon's H = −Σ p log p is the same functional form as Gibbs's entropy in statistical mechanics, S = −k_B Σ p_i log p_i, where the sum runs over microstates. Boltzmann constant k_B sets the units (joules per kelvin); Shannon used logarithm base 2 to count bits. Both quantify the average log-probability of states drawn from a probability distribution. The connection was deepened by Jaynes (1957), who reformulated statistical mechanics as inference under maximum-entropy priors — the equilibrium distribution is the one that maximises Shannon entropy subject to known constraints (mean energy, conservation laws). Information theory and thermodynamics speak the same language.

What is differential entropy for continuous distributions?

Replace the sum with an integral: h(X) = −∫ f(x) log f(x) dx for density f. Differential entropy is not a direct analogue of the discrete version — it can be negative, it changes under unit changes by an additive constant, and it is not invariant under reversible transformations. It still appears in continuous information theory (Gaussian channels, mutual information, variational inference), but treat it as a relative quantity, not an absolute count of bits. For a Gaussian, h(𝒩(μ, σ²)) = ½ log(2πeσ²); doubling σ adds 1 bit of differential entropy.