Probability

Moment-Generating Function

M_X(t) = E[e^(tX)] — its k-th derivative at 0 gives E[X^k]; uniquely identifies distribution

The moment-generating function (MGF) of a random variable X is M_X(t) = E[e^(tX)] = Σ E[X^k] t^k / k!, when this expectation exists in some neighborhood of t = 0. Each derivative M_X^(k)(0) = E[X^k] gives the k-th moment — hence the name. Uniqueness theorem: if two random variables have MGFs that agree on an open interval containing 0, they have the same distribution. Examples: standard normal Z ~ N(0,1) has M_Z(t) = e^(t²/2); exponential rate λ has 1/(1 − t/λ); Poisson μ has e^(μ(e^t − 1)). For independent X, Y: M_{X+Y}(t) = M_X(t) · M_Y(t) — convolution becomes multiplication. The closely related characteristic function φ_X(t) = E[e^(itX)] always exists (since |e^(itx)| = 1) and is preferred when the MGF doesn't.

DefinitionM_X(t) = E[e^(tX)]
MomentsM_X^(k)(0) = E[X^k]
ConvolutionM_{X+Y} = M_X · M_Y (indep)
Standard normale^(t²/2)
Poisson μe^(μ(e^t − 1))
Characteristic fnalways exists

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why MGF matters

Distribution identification. Two random variables with matching MGFs on an open interval around 0 have the same distribution — encoding the entire distribution in a single function of t.
Moment computation. Differentiating M_X k times at 0 gives E[X^k]. Computing moments by direct integration is often harder than differentiating a known MGF.
Sums of independent variables. M_{X+Y} = M_X · M_Y for independent X, Y. Convolution of densities becomes multiplication of MGFs — turning hard integrals into easy products.
Central limit theorem proofs. Expanding M_X near 0 and taking the n-th power of the standardized MGF gives e^(t²/2) — the MGF of N(0,1). One of the cleanest CLT proofs.
Large deviations. The Cramér-Chernoff bound P(X ≥ a) ≤ exp(−sup_t (ta − log M_X(t))) yields exponential tail bounds. Foundation of concentration inequalities and information theory.
Cumulant generating function. K_X(t) = log M_X(t) has Taylor coefficients = cumulants κ_k. κ_1 = mean, κ_2 = variance, κ_3 / σ³ = skewness, κ_4 / σ⁴ = excess kurtosis.
Branching processes and queueing. Total offspring of a Galton–Watson tree, busy periods of M/G/1 queues, and ruin probabilities in insurance all admit closed-form analysis via MGFs.

Formal definition

For a random variable X with cumulative distribution function F_X, the moment-generating function is:

M_X(t) = E[e^(tX)] = ∫ e^(tx) dF_X(x)

defined wherever this expectation is finite. Whenever M_X is finite on an open interval (−h, h) for some h > 0, we say "the MGF exists." On that interval M_X is real-analytic, with Taylor expansion:

M_X(t) = ∑_{k=0}^∞ E[X^k] t^k / k!

and all derivatives can be obtained by differentiating term-by-term.

Library of MGFs

Distribution	M_X(t)	Domain
Bernoulli(p)	1 − p + pe^t	all t
Binomial(n, p)	(1 − p + pe^t)^n	all t
Poisson(μ)	e^(μ(e^t − 1))	all t
Geometric(p)	pe^t / (1 − (1 − p)e^t)	t < −log(1 − p)
Uniform(a, b)	(e^(tb) − e^(ta)) / (t(b − a))	all t (limit at 0 = 1)
Exponential(λ)	λ / (λ − t)	t < λ
Gamma(α, λ)	(λ / (λ − t))^α	t < λ
Normal N(μ, σ²)	e^(μt + σ²t²/2)	all t
Chi-squared(k)	(1 − 2t)^(−k/2)	t < 1/2
Cauchy	does not exist	—
Log-normal	does not exist for t > 0	t ≤ 0

Computing moments via differentiation

For X ~ Exponential(λ), M_X(t) = λ / (λ − t). Differentiating once: M_X'(t) = λ / (λ − t)². Setting t = 0: E[X] = 1/λ. Differentiating again: M_X''(t) = 2λ / (λ − t)³, so E[X²] = 2/λ². Variance: E[X²] − E[X]² = 2/λ² − 1/λ² = 1/λ². The MGF gives all moments at once via Taylor expansion:

λ / (λ − t) = 1/(1 − t/λ) = ∑ (t/λ)^k = ∑ k!(1/λ)^k · t^k / k!
∴ E[X^k] = k! / λ^k

Convolution becomes multiplication

For independent X, Y:

M_{X+Y}(t) = E[e^(t(X+Y))] = E[e^(tX) e^(tY)] = E[e^(tX)] · E[e^(tY)] = M_X(t) M_Y(t)

The independence step uses E[fg] = E[f]E[g] for independent f(X), g(Y). Examples:

Sum of independent normals. X ~ N(μ_X, σ_X²), Y ~ N(μ_Y, σ_Y²) independent: M_{X+Y}(t) = e^((μ_X + μ_Y)t + (σ_X² + σ_Y²)t²/2) — again Gaussian, with mean and variance summing.
Sum of independent Poissons. X ~ Poisson(μ_X), Y ~ Poisson(μ_Y): M_{X+Y}(t) = e^((μ_X + μ_Y)(e^t − 1)) = MGF of Poisson(μ_X + μ_Y).
Sum of n i.i.d. exponentials. M_{S_n}(t) = (λ/(λ−t))^n = MGF of Gamma(n, λ).

CLT proof via MGFs

Let X_1, ..., X_n be i.i.d. with mean 0 and variance 1, with MGF M(t) finite near 0. The standardized sum Z_n = (X_1 + ... + X_n)/√n has:

M_{Z_n}(t) = M(t/√n)^n

Expand log M(t) = 0 + 0 + t²/2 + O(t³) near 0 (mean = 0, variance = 1). Then:

n log M(t/√n) = n · (t²/(2n) + O(n^(−3/2))) = t²/2 + O(n^(−1/2))

So M_{Z_n}(t) → e^(t²/2) — the MGF of N(0, 1). Pointwise convergence of MGFs (in a neighborhood of 0) implies convergence in distribution, completing the CLT.

Cramér–Chernoff bound

From Markov: P(X ≥ a) = P(e^(tX) ≥ e^(ta)) ≤ E[e^(tX)] / e^(ta) = e^(−ta) M_X(t) for any t > 0. Optimize:

P(X ≥ a) ≤ exp(−sup_{t > 0} (ta − log M_X(t))) = e^(−I(a))

I(a) is the Cramér transform — the Legendre dual of the cumulant generating function K(t) = log M(t). For sums S_n of n i.i.d. variables, I scales: P(S_n/n ≥ a) ≤ e^(−n I(a)). This is the foundation of large-deviations theory.

Characteristic function: the safe substitute

The characteristic function φ_X(t) = E[e^(itX)] always exists since |e^(itx)| = 1. It's the Fourier transform of the density (when it exists). Key properties:

|φ_X(t)| ≤ 1, and φ_X(0) = 1.
φ_X is uniformly continuous on ℝ.
Lévy's continuity theorem: distributions converge weakly iff characteristic functions converge pointwise.
Convolution becomes multiplication (same as MGF).
Inversion formula: f_X(x) = (1/(2π)) ∫ e^(−itx) φ_X(t) dt for absolutely continuous distributions.

For Cauchy: φ(t) = e^(−|t|), perfectly well-defined; the MGF is hopeless.

Common misconceptions

"MGF always exists." The Cauchy distribution has no MGF; the log-normal's MGF is infinite for any t > 0. Use the characteristic function when in doubt.
"MGF and PDF are the same." They are different transforms of the distribution; the MGF exponentiates the variable, the PDF is the density.
"Equal moments imply equal distributions." Not always — the moment problem can be ill-posed. The lognormal and a slightly perturbed lognormal share all moments. Equal MGFs in a neighborhood of 0 do imply equal distributions.
"You can take the n-th root of M_{S_n} to recover M_X." Correct only if you know the variables are i.i.d. — otherwise the factorization isn't unique.
"MGF of dependent variables factors." Independence is essential. Without it M_{X+Y} ≠ M_X M_Y in general.
"Characteristic function is more abstract, less useful." The opposite — it's preferred in modern probability because it always exists and uniqueness/continuity theorems are cleaner.

Frequently asked questions

Why is M_X^(k)(0) the k-th moment?

Differentiate M_X(t) = E[e^(tX)] with respect to t under the integral: M_X'(t) = E[X e^(tX)]. Evaluate at t = 0: M_X'(0) = E[X]. Continue: M_X''(t) = E[X² e^(tX)], so M_X''(0) = E[X²]. By induction, M_X^(k)(0) = E[X^k]. This works whenever the MGF exists in an open interval around 0, which justifies passing differentiation under the expectation by dominated convergence.

When does the MGF fail to exist?

The MGF exists at t when E[e^(tX)] is finite. For heavy-tailed distributions, this fails: the Cauchy distribution has infinite mean, so M(t) = ∞ for all t ≠ 0. The log-normal exists for t ≤ 0 but not t > 0. The Pareto distribution with shape α < ∞ fails for any t > 0. When the MGF fails, the characteristic function φ_X(t) = E[e^(itX)] still exists since |e^(itx)| = 1 — and is the standard substitute.

What is the characteristic function and why is it preferred?

The characteristic function φ_X(t) = E[e^(itX)] always exists for any random variable X — the integrand has modulus 1. It uniquely identifies the distribution (Lévy's continuity theorem) and converts convolution to multiplication. Lévy's continuity theorem says distributions converge weakly if and only if their characteristic functions converge pointwise. The MGF equals φ_X(−it) when it exists, but the characteristic function works without integrability assumptions.

How does convolution become multiplication?

If X, Y are independent, then M_{X+Y}(t) = E[e^(t(X+Y))] = E[e^(tX) e^(tY)] = E[e^(tX)] E[e^(tY)] = M_X(t) M_Y(t). Independence is essential — without it, expectations don't factor. This converts the convolution of densities (which describes the distribution of X+Y) into pointwise multiplication of MGFs. Sums of n i.i.d. random variables: M_{S_n}(t) = M_X(t)^n — extracts large-n behavior easily.

How is the central limit theorem proved via MGFs?

For i.i.d. X_i with mean 0 and variance 1, expand M_X(t) = 1 + ½ t² + o(t²) near 0. The standardized sum Z_n = (X_1 + ... + X_n)/√n has MGF M_{Z_n}(t) = M_X(t/√n)^n = (1 + t²/(2n) + o(1/n))^n → e^(t²/2) as n → ∞. The limit is the MGF of N(0,1) — and pointwise convergence of MGFs in a neighborhood of 0 implies weak convergence of distributions. This is the fastest CLT proof for random variables with all moments.

What is the Cramér-Chernoff bound (large deviations)?

For any t > 0, Markov's inequality applied to e^(tX) gives P(X ≥ a) ≤ e^(-ta) M_X(t). Optimizing over t > 0: P(X ≥ a) ≤ exp(-I(a)), where I(a) = sup_t (ta − log M_X(t)) is the Cramér transform — the Legendre dual of the cumulant generating function. For sums of n i.i.d. variables this gives exponential tail bounds: P(S_n/n ≥ a) ≤ exp(-n I(a)). Foundation of large-deviations theory and concentration inequalities.