Inequalities

Jensen's Inequality

Convex function of a mean is at most the mean of the function — the chord-above-curve fact, applied to probability distributions

For convex f and any random variable X, f(E[X]) ≤ E[f(X)] — chord-above-curve, promoted to distributions. Concave flips the sign.

  • Statementf(E[X]) ≤ E[f(X)] for convex f
  • Concave versiong(E[X]) ≥ E[g(X)] — flips the sign
  • EqualityX constant a.s., or f affine on support
  • Log corollarylog E[X] ≥ E[log X] for X > 0
  • Proved byJohan Jensen, 1906
  • Foundation ofEntropy bounds, EM algorithm, KL divergence

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The inequality in three equivalent forms

Jensen's inequality has three faces, increasing in generality:

  1. Two-point form. For a convex function f : ℝ → ℝ and t ∈ [0, 1], f((1−t)a + tb) ≤ (1−t) f(a) + t f(b). This is just the definition of a convex function.
  2. Finite weighted form. For convex f, points x₁, …, x_n, weights λᵢ ≥ 0 with Σλᵢ = 1: f(Σ λᵢ xᵢ) ≤ Σ λᵢ f(xᵢ). Pure induction on n from the two-point form.
  3. Probabilistic form. For convex f and any integrable random variable X: f(E[X]) ≤ E[f(X)]. The natural extension to general distributions, proved by approximating X by discrete random variables.

The concave version replaces convex f with concave g and flips the inequality: g(E[X]) ≥ E[g(X)]. A function is concave iff −g is convex, so the two cases are dual; most applications deploy whichever direction is convenient.

A reminder: what 'convex' actually means

A function f : I → ℝ on an interval is convex if for all a, b ∈ I and t ∈ [0, 1], f((1−t)a + tb) ≤ (1−t)f(a) + t·f(b). Geometrically: the secant line connecting any two points on the graph lies above (or on) the graph between them. Strict convexity replaces ≤ by < for t ∈ (0, 1) and distinct a, b — the chord lies strictly above the curve except at the endpoints.

Differentiable f is convex iff f' is non-decreasing, iff f'' ≥ 0 wherever twice-differentiable. The chord-above-curve condition can be checked numerically by sampling pairs of points. Important convex functions to keep in mind:

  • x², x⁴, x⁶, … (even powers) — convex on ℝ.
  • e^x — convex on ℝ, strictly so.
  • −log x, 1/x — convex on (0, ∞).
  • |x|, max(0, x) — convex on ℝ, not differentiable at 0.
  • x log x — convex on (0, ∞), the entropy-style 'self-information' function.

Why it is true — geometric proof

For convex f, recall the chord-above-curve property: for any x in the convex set the function lies below every chord connecting points on its graph. Equivalently, at any point x₀ there is a supporting line through (x₀, f(x₀)) that lies below the graph everywhere:

f(x) ≥ f(x₀) + s · (x − x₀)   for all x,

where s is any subgradient of f at x₀ (the slope of any supporting line). For differentiable f, s = f'(x₀); for non-differentiable convex f, any element of the subdifferential ∂f(x₀) will do.

Now pick x₀ = E[X] and take expectations of both sides:

E[f(X)] ≥ E[f(E[X]) + s · (X − E[X])]
        = f(E[X]) + s · (E[X] − E[X])
        = f(E[X]).

The supporting-line term has expectation zero because s is a constant (chosen at x₀, independent of the realisation of X) and E[X − E[X]] = 0. Two lines of arithmetic. The whole inequality reduces to: chord above curve, take expectations.

Worked examples

Variance is non-negative — Jensen with f(x) = x²

Take f(x) = x², a convex function. Jensen says (E[X])² ≤ E[X²], equivalently 0 ≤ E[X²] − (E[X])² = Var(X). Variance is non-negative. Equality iff X is almost surely constant — the classical degenerate case.

AM ≥ GM — Jensen with f(x) = −log x

For positive x₁, …, x_n and equal weights 1/n, take f(x) = −log x (convex on (0, ∞)). Jensen gives

−log(Σ xᵢ/n)  ≤  Σ (−log xᵢ)/n
log(Σ xᵢ/n)   ≥  Σ log xᵢ/n  =  log (Π xᵢ)^(1/n).

Exponentiating: arithmetic mean Σxᵢ/n ≥ geometric mean (Π xᵢ)^(1/n). Equality iff all xᵢ are equal — the strict-convexity case. AM ≥ GM, the most-used elementary inequality after the triangle inequality, falls out as a one-line corollary of Jensen.

Entropy of a fair die

For a probability distribution p₁, …, p_n on n outcomes, the Shannon entropy is H(p) = −Σ pᵢ log pᵢ = E[−log p(X)]. The function f(x) = −log x is convex on (0, 1], so by Jensen H(p) = E[−log p(X)] ≥ −log E[p(X)] = −log Σ pᵢ². For the uniform distribution pᵢ = 1/n, Σ pᵢ² = 1/n, giving H ≥ log n. Combined with H ≤ log n (which uses concave Jensen the other way), the uniform distribution achieves the maximum entropy log n. Among all distributions on n outcomes, uniform is the most uncertain — the Jensen proof is exactly that one inequality.

Log E[X] ≥ E[log X] — concave Jensen

For a positive random variable X with finite mean, the concavity of log gives log E[X] ≥ E[log X]. The difference log E[X] − E[log X] is exactly the Kullback-Leibler divergence between X's distribution and a distribution shifted to its expected value — and is the inequality behind the EM algorithm's monotone improvement, behind variational bayes' evidence lower bound, behind Gibbs' inequality and the second law of thermodynamics in its information-theoretic form.

Jensen vs related inequalities

InequalityWhat it boundsConvex/concave function usedRelation to Jensen
MarkovP(X ≥ a) for X ≥ 0Indicator argument; not JensenCousin via Chebyshev / Markov ↦ Jensen on log
ChebyshevP(|X − μ| ≥ kσ)Markov applied to (X − μ)²Corollary of Markov on the squared variable
Cauchy-Schwarz|⟨u, v⟩|² ≤ ⟨u, u⟩⟨v, v⟩Quadratic non-negativitySame flavour; proved without Jensen but generalises Jensen for ψ(x) = x²
Hölder|Σ aᵢbᵢ| ≤ (Σ|aᵢ|ᵖ)^(1/p) (Σ|bᵢ|^q)^(1/q)Power function via concavity of logGeneralises Cauchy-Schwarz and follows from concave Jensen on log
AM ≥ GMArithmetic mean ≥ geometric meanf = −log convex on (0, ∞)One-line corollary of Jensen with equal weights
Gibbs / KL ≥ 0KL(p ‖ q) ≥ 0f = −log convexDirect application of Jensen to log(p/q) under q
HoeffdingTail bound for bounded sumsExponential moment generating bound, convexUses Jensen on the exponential MGF inside Markov

Jensen sits at the centre of this family. Markov and Chebyshev are non-Jensen ancestors (they only need monotonicity and non-negativity); Hölder, AM-GM, Gibbs and KL are direct Jensen corollaries; Hoeffding and similar concentration bounds use Jensen inside Markov's exponential trick.

Where Jensen earns its keep

  • Information theory — Gibbs' inequality. The KL divergence KL(p ‖ q) = Σ pᵢ log(pᵢ/qᵢ) ≥ 0 for any two distributions p, q on the same support. Proof: KL(p ‖ q) = E_p[−log(q/p)]; apply concave Jensen with f = log to flip the sign and get KL ≥ −log E_p[q/p] = −log 1 = 0. The cornerstone of source coding theorems and statistical inference.
  • EM algorithm. The expectation-maximisation algorithm for mixture models constructs a lower bound on the log-likelihood using Jensen: log E_q[p(X, Z) / q(Z)] ≥ E_q[log(p(X, Z) / q(Z))]. Maximising the lower bound (E-step expected log-likelihood, M-step parameter update) monotonically increases the true log-likelihood. The Jensen step is exactly why EM never goes backwards.
  • Variational Bayes. The evidence lower bound (ELBO) used in modern variational inference and in variational autoencoders is the same Jensen-bounded log-marginal: log p(X) ≥ E_q[log p(X, Z) − log q(Z)]. Maximising over q is the variational optimisation; the gap log p(X) − ELBO = KL(q ‖ posterior).
  • Risk-averse decision making. Concave utility u(·) of wealth gives E[u(W)] ≤ u(E[W]) by Jensen — the certainty equivalent of an uncertain payoff is less than its expected value. The gap measures risk aversion (Pratt-Arrow coefficient). This is why people buy insurance even at premia exceeding expected loss.
  • Convex risk measures in finance. Coherent and convex risk measures in Artzner-Delbaen-Eber-Heath theory are essentially expectations of convex functions of P&L; Jensen-type bounds connect them to scenario-based risk constraints.
  • PAC-Bayes and learning theory. Generalisation bounds in PAC-Bayesian analysis are derived by applying Jensen to the moment generating function of the loss; the convexity of x ↦ e^x is the kernel that produces concentration bounds.
  • Statistical mechanics — Gibbs' free energy. The Helmholtz free energy F = −kT log Z (Z the partition function) provides a lower bound on the equilibrium free energy of a system by Jensen applied to the Boltzmann distribution. The variational principle for F is the physics version of the ELBO.
  • Reinforcement learning policy gradients. Trust region policy optimisation (TRPO) and proximal policy optimisation (PPO) construct surrogate objectives whose monotone-improvement guarantees follow from Jensen-style bounds on the KL between successive policies.

Common mistakes

  • Reversing the inequality for concave functions silently. Jensen's direction depends on convexity vs concavity. log is concave, x² is convex, e^x is convex, |x| is convex, 1/x is convex on (0, ∞). Picking the wrong sign means you have proved the opposite of what you wanted — very easy to miss.
  • Using Jensen on non-convex functions. The inequality fails for general functions. f(x) = sin x on [0, π] is concave on part of the interval and convex on the other — you cannot apply Jensen globally. Restrict to a domain where convexity holds.
  • Forgetting integrability. Jensen needs E[X] to exist and E[f(X)] to make sense. For heavy-tailed distributions where E[f(X)] is infinite, the inequality is technically vacuous (∞ ≥ anything). For random variables without finite mean (Cauchy distribution), the statement does not apply at all.
  • Using strict inequality without strict convexity. Jensen gives ≤, not <. Strict inequality requires strict convexity AND a non-degenerate X. If X is constant, equality holds trivially; if f is affine, equality holds trivially. Always check before claiming strict gain.
  • Applying Jensen to a function of two random variables coordinate-wise. For a multivariate convex function f(X, Y), f(E[X], E[Y]) ≤ E[f(X, Y)] holds, but you cannot prove it by applying univariate Jensen in X with Y fixed and then in Y. The full multivariate statement needs the multivariate notion of convexity (Hessian PSD).
  • Misreading the gap as a rate. The Jensen gap E[f(X)] − f(E[X]) is a single number tied to one fixed distribution. It does not predict how the gap scales when the distribution changes. For asymptotic-rate analyses you usually need second-order bounds involving Var(X) and the second derivative of f.
  • Forgetting Jensen needs only finite first moments — but its sharpened versions need more. Plain Jensen needs E[X] and E[f(X)] finite. Sharpened versions involving variance need finite second moments; subgaussian concentration bounds need exponential moments. Picking the right sharpening for your moment regime matters.

Kelly betting and the geometric mean

The Kelly criterion (J. L. Kelly Jr., 1956) for sizing a sequence of independent bets maximises the expected log-wealth, by Jensen-derived logic. If you have a sequence of independent percentage returns r₁, r₂, …, your terminal wealth multiplies by (1 + r₁)(1 + r₂)·… and the long-run growth rate is the geometric mean. Maximising arithmetic-mean return (myopic Markowitz) is suboptimal for repeated bets because Jensen tells you log E[1+r] ≥ E[log(1+r)] — the gap is your 'volatility tax'.

Kelly's insight: maximise E[log(1+r)] instead. Because log is concave, this is the largest growth rate any betting strategy can achieve asymptotically. For binary bets, the optimal Kelly fraction is f* = (p·b − q)/b where p is win probability, b is the win/loss ratio, q = 1 − p. The Kelly criterion is the Jensen-driven answer to compound investing — used by professional poker players, sports bettors, and certain quantitative funds, despite the fact that strict Kelly is psychologically painful and most practitioners use fractional Kelly (half or quarter) for variance control.

Maximum entropy and Jensen

One of Jensen's biggest uses in applied probability is the maximum entropy principle. For a distribution p on a finite set with constraints E_p[T_i(X)] = τ_i for i = 1, …, m, the maximum-entropy distribution subject to those constraints is found by maximising H(p) = −Σ pᵢ log pᵢ. The standard derivation uses Lagrange multipliers: the optimum has the exponential family form p(x) = exp(Σ λ_i T_i(x))/Z. The reason this is the right answer — rather than just one critical point — is Jensen: the entropy function p ↦ −Σ pᵢ log pᵢ is concave, so any KKT point of the constrained problem is a global maximum.

Concretely: given moment constraints on X (mean, variance, higher moments), the Jensen-supported max-entropy distribution is the Gaussian. Given a support constraint and the mean, it is the exponential distribution. Given only support, it is uniform. Every 'natural' distribution in classical statistics is a Jensen-derived maximiser of entropy under appropriate constraints.

Beyond real-valued: Jensen in vector spaces and probability spaces

Jensen extends naturally to vector-valued random variables and to convex functions on more general domains. For X taking values in ℝⁿ and a convex function f : ℝⁿ → ℝ, the same inequality f(E[X]) ≤ E[f(X)] holds — equality iff X is concentrated on an affine subspace where f is affine. For X taking values in a Banach space and f : Banach → ℝ convex and lower-semicontinuous, the inequality still holds, with appropriate Pettis or Bochner integrability hypotheses.

An even more general setting: Choquet's theorem. Every point in a compact convex set K is the barycenter of a probability measure concentrated on the extreme points of K. Jensen-style integration of a convex function against this measure gives the convex-function value at the barycenter. This is the inequality's deepest geometric form and the route by which Jensen connects to the theory of barycentric subdivisions, Bauer's maximum principle, and the Krein-Milman theorem.

Conditional Jensen and tower bounds

Jensen extends to conditional expectation: for convex f and any sub-σ-algebra G, f(E[X | G]) ≤ E[f(X) | G] almost surely. Taking outer expectations gives f(E[X]) ≤ E[E[f(X) | G]] = E[f(X)], the unconditional version, but the conditional form is what powers martingale theory and information-theoretic data processing inequalities.

A useful corollary: if (X_n) is a sequence with X_n → X in L¹ and f is continuous convex, then E[f(X)] ≤ liminf E[f(X_n)] — the Fatou-Jensen combination used throughout the theory of weak convergence. Another consequence: the entropy rate H(X̄_n) of an empirical average is monotonically decreasing, by applying Jensen to the concave entropy function — formalising the intuition that averaging loses information.

When the Jensen gap is tight

The Jensen gap E[f(X)] − f(E[X]) ≥ 0 measures how much the distribution of X 'spreads out' f's value. Two formal estimates make this precise. (i) If f is twice differentiable with f''(x) ≥ m for all x in the support of X, then E[f(X)] − f(E[X]) ≥ (m/2) Var(X). The minimum curvature times the variance is a free lower bound on the Jensen gap. (ii) Conversely, if f is Lipschitz-smooth with f''(x) ≤ M, the Jensen gap is bounded above by (M/2) Var(X). Together: for f with curvature in [m, M], the Jensen gap is exactly Var(X) up to constants. Information-geometric versions of this trade-off underpin the analysis of stochastic gradient descent's noise robustness.

Why bootstrapped statistics are biased — and how Jensen explains it

Bootstrap estimates of nonlinear statistics inherit a systematic bias from Jensen's inequality. If θ̂ is a sample estimator of a parameter θ and g(θ) is a nonlinear function, then E[g(θ̂)] ≠ g(E[θ̂]) by Jensen — the bootstrap estimate of g(θ) is biased by approximately (g''(θ̂)/2) · Var(θ̂) for smooth g, by the same Jensen-gap-equals-curvature-times-variance estimate.

This is why log-transformed estimators (variance, ratio statistics, geometric means) have nontrivial bootstrap bias even when the underlying parameter estimator is unbiased. Standard practice is to apply a bias-correction (the Efron BCa correction or the jackknife correction) that estimates the Jensen-gap term and subtracts it. The phenomenon is purely Jensen — apply any nonlinear function to a noisy estimator and the expected value of the function differs from the function of the expected value.

Moment generating functions and Hoeffding via Jensen

The moment generating function (MGF) M_X(s) = E[e^(sX)] is the building block of concentration inequalities. Because x ↦ e^(sx) is convex for any real s, Jensen gives the trivial bound M_X(s) ≥ e^(s·E[X]). The interesting estimate goes the other way: bounding M_X(s) from above produces tail bounds via Markov on e^(sX). Hoeffding's lemma is the canonical example — for X ∈ [a, b] with E[X] = 0, M_X(s) ≤ e^(s²(b−a)²/8). The proof uses a Jensen-style bound on the curvature of log M_X.

Chernoff bounds, sub-Gaussian and sub-exponential tail inequalities, McDiarmid's bounded-difference inequality — every one of them rests on a Jensen-flavoured estimate of the MGF. The recurring pattern is: identify a convex function (often the exponential), bound its expectation using Jensen or its sharpening, optimise the resulting Markov bound over the free parameter. This is the workflow that produces nearly every concentration inequality in modern statistical learning theory.

Ergodicity and time-averages

In ergodic theory, Jensen's inequality is the precise tool that distinguishes time averages from ensemble averages. For a stationary process X_n and a convex f, the long-run sample mean of f(X_n) converges (by the ergodic theorem) to E[f(X)], not to f(E[X]). The gap E[f(X)] − f(E[X]) is the systematic difference between 'average of the realised path' and 'function of the average'. This gap is the basis of the St. Petersburg paradox, of why naïve expected-value maximisation fails for multiplicative gambles, and of the time-average-vs-ensemble-average distinction emphasised by Ole Peters' ergodicity economics programme.

Costed claims

Jensen: f(E[X]) ≤ E[f(X)] for convex f, with the concave version log E[X] ≥ E[log X] for X > 0 — the engine of EM, ELBO and KL ≥ 0. Equality iff X is constant a.s. or f is affine on the support of X. Jensen gap is at least (m/2)·Var(X) when f'' ≥ m. Strong duality holds under Slater's condition (strictly feasible interior). KKT: Lagrangian gradient = 0 + λ·g(x) = 0 (slackness). Convex hull of N points in 2D and 3D: O(N log N). Cauchy-Schwarz: equality iff u, v collinear — the cousin inequality for inner-product spaces.

Frequently asked questions

What does Jensen's inequality say in one sentence?

For any convex function f and any random variable X with finite expectation, f(E[X]) ≤ E[f(X)]. Applied to a finite point set with weights it reads f(Σ λᵢ xᵢ) ≤ Σ λᵢ f(xᵢ) whenever λᵢ ≥ 0 and Σ λᵢ = 1 — i.e. the function of a convex combination is at most the convex combination of the function. The concave version flips ≤ to ≥.

Why does the chord-above-curve picture imply Jensen?

For a two-point random variable X taking value a with probability 1−t and b with probability t, E[X] = (1−t)a + tb and E[f(X)] = (1−t)f(a) + t f(b). The chord from (a, f(a)) to (b, f(b)) evaluated at x = E[X] is exactly E[f(X)], and the function value f(E[X]) is the curve at that x. Convexity says chord lies above curve, so f(E[X]) ≤ E[f(X)]. The general inequality is obtained by taking limits of finite distributions.

When does equality hold in Jensen?

Two cases. (i) X is almost surely constant, so both sides reduce to f of that constant. (ii) f is affine on the support of X — affine f means the chord IS the curve, so the inequality becomes an equation. Strict convexity (no straight segments in the graph) plus a non-degenerate X gives strict inequality.

How is Jensen used to prove entropy non-negativity?

Take f(x) = log(1/x) — convex on (0, ∞). For a probability distribution p with H(p) = Σ pᵢ log(1/pᵢ) = E_p[log(1/p(X))], Jensen gives H(p) = E_p[log(1/p(X))] ≥ log(1/E_p[p(X)]) = log(1/Σpᵢ²) ≥ log(1) = 0 for any distribution. The standard proof of H(p) ≤ log n uses the dual concave Jensen with f = log.

Does Jensen prove AM ≥ GM?

Yes. Take f(x) = −log x (convex on (0, ∞)). For positive reals x₁, …, x_n with equal weights 1/n, Jensen says −log(Σ xᵢ/n) ≤ Σ (−log xᵢ)/n, equivalently log(Σ xᵢ/n) ≥ Σ log xᵢ/n. Exponentiating: AM = Σ xᵢ/n ≥ exp(Σ log xᵢ/n) = (Π xᵢ)^(1/n) = GM. Equality iff all xᵢ are equal.

What's the concave version?

For a concave function g and a random variable X, g(E[X]) ≥ E[g(X)]. Concave functions sit above their chords (or equivalently, −g is convex), so the chord-below-curve picture reverses the inequality. In particular, log is concave: log E[X] ≥ E[log X] for any positive random variable X. This single fact is the engine of the EM algorithm and of variational lower bounds.

Who was Jensen and when did he prove it?

Johan Ludwig William Valdemar Jensen was a Danish mathematician and telecom engineer (at the Copenhagen Telephone Company). He proved the inequality in his 1906 paper 'Sur les fonctions convexes et les inégalités entre les valeurs moyennes' in Acta Mathematica — the same paper where he formalised the modern definition of convex function. Earlier two-point versions trace to Hölder (1889) and others; Jensen's contribution was the n-point and integral generalisation.