Probability
Conditional Expectation
E[X|Y] — the best mean-square predictor of X from Y, itself a random variable
E[X|Y] is the expected value of X given knowledge of Y. Unlike E[X], it is a random variable — a function of Y. Tower property: E[E[X|Y]] = E[X]. In L², it is the orthogonal projection of X onto functions of Y, i.e. the best mean-square predictor. Foundation of regression, martingales, and conditional probability itself.
- Tower propertyE[E[X|Y]] = E[X]
- L² projectionE[X|Y] minimizes E[(X − g(Y))²]
- Variance decompVar(X) = E[Var(X|Y)] + Var(E[X|Y])
- Gaussian caseE[X|Y] is affine: μ_X + Σ_XY Σ_YY⁻¹(Y − μ_Y)
- Definesmartingale: E[X_{n+1} | F_n] = X_n
- GeneralizesKolmogorov 1933 — conditional on σ-algebra
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How conditional expectation works
For a random variable X and an event B with P(B) > 0, the elementary conditional expectation is
E[X | B] = E[X · 1_B] / P(B).
This is a single number — the average of X restricted to the subset where B happened, renormalised by the probability of B. Now upgrade to conditioning on a random variable Y. For each value y in the support of Y, define
g(y) = E[X | Y = y].
This is a deterministic function of y. The random variable E[X|Y] := g(Y) — that is, "evaluate g at the actual outcome of Y". So E[X|Y] is itself random — it changes with the realisation of Y. The defining identity, due to Kolmogorov, is
E[E[X|Y] · h(Y)] = E[X · h(Y)] for every bounded measurable h.
Taking h ≡ 1 gives the tower property: E[E[X|Y]] = E[X]. Taking h(Y) = 1 on a Y-event A gives the partial-averaging identity.
Worked example — height and weight
Suppose (X, Y) is jointly Normal with means μ_X = 170 cm, μ_Y = 70 kg, standard deviations σ_X = 10 cm, σ_Y = 15 kg, and correlation ρ = 0.6. Then the conditional expectation of X given Y is the regression line
E[X | Y = y] = μ_X + ρ (σ_X / σ_Y)(y − μ_Y)
= 170 + 0.6 · (10 / 15)(y − 70)
= 170 + 0.4 · (y − 70).
So for a person weighing y = 85 kg, the expected height is 170 + 0.4 · 15 = 176 cm. And the conditional variance is
Var(X | Y) = σ_X² (1 − ρ²) = 100 · (1 − 0.36) = 64,
so σ_X|Y = 8 cm — the conditional standard deviation is 80% of the marginal because Y explains 36% of the variance. Verify the tower property: averaging E[X|Y] over Y gives E[μ_X + 0.4(Y − μ_Y)] = μ_X + 0.4 · 0 = μ_X = 170. Check.
Conditional expectation as orthogonal projection
In L²(P) — the Hilbert space of square-integrable random variables — define the inner product ⟨X, Y⟩ = E[XY]. The subspace of all functions g(Y) of Y is closed. The conditional expectation E[X|Y] is the orthogonal projection of X onto this subspace. Equivalently, it is the unique function of Y minimising the mean-square error
E[(X − g(Y))²].
The residual X − E[X|Y] is orthogonal (in the L² sense) to every measurable function of Y — including Y itself, Y², log Y, indicator functions of arbitrary Y-events. This is the geometric content of conditional expectation, and it is why everything in least-squares regression flows from it.
The tower property — and why it solves problems
The identity E[X] = E[E[X|Y]] turns a hard expectation into an average of easier expectations. Worked use cases:
Random sums (Wald's identity). Let N be a non-negative integer random variable and X₁, X₂, … be i.i.d. with mean μ, independent of N. Then E[X₁ + … + X_N] = E[N · μ] = μ · E[N]. Proof: condition on N. E[Σ | N = n] = nμ; tower gives E[Σ] = E[Nμ] = μE[N].
Gambler's ruin. Player starts with k chips, plays fair coin flips winning/losing one chip per round, until ruined (0 chips) or rich (N chips). Probability of reaching N: condition on first step. If p_k is the probability of winning from k, then p_k = ½ p_{k−1} + ½ p_{k+1}, p_0 = 0, p_N = 1. Solution p_k = k/N — a one-line argument via conditional expectation.
Branching processes. A particle has a random number of offspring with mean m. Population at generation n has expected size mⁿ. Proof: condition on generation 1's size; tower gives E[Z_n] = m · E[Z_{n−1}]. Recurrence yields mⁿ.
Conditional variance decomposition
The total variance of X splits cleanly into a within-Y component and a between-Y component:
Var(X) = E[Var(X|Y)] + Var(E[X|Y]).
The first term — "average within-group variance" — measures residual noise that conditioning on Y cannot remove. The second term — "variance of conditional means across Y" — measures how much Y discriminates X. In regression, R² is the second term divided by Var(X) — the fraction of variance explained by Y. In ANOVA, the same identity is the within/between decomposition. In Monte Carlo, Rao-Blackwellisation replaces X by E[X|Y] to keep the mean and shrink the variance (the second term goes away).
Variants and generalisations
- Conditional expectation given a σ-algebra. Kolmogorov's general definition replaces "given Y" with "given a sub-σ-algebra G". E[X|G] is the G-measurable random variable Z satisfying E[Z · 1_A] = E[X · 1_A] for all A ∈ G. Conditioning on a random variable Y is the special case G = σ(Y).
- Conditional probability. P(A | Y) = E[1_A | Y] — conditional probability is a special case of conditional expectation. Important: conditioning on a measure-zero event (like Y = y for continuous Y) is well-defined only as a function-of-Y; the value at a single y is only determined almost surely.
- Disintegration / regular conditional distribution. Under mild conditions there exists a Markov kernel K(y, dx) with K(y, ·) the conditional law of X given Y = y; integrating against K reproduces the joint distribution.
- Conditional independence. X ⊥ Y | Z means E[f(X) g(Y) | Z] = E[f(X)|Z] · E[g(Y)|Z]. Foundation of graphical models, Bayesian networks, and causal inference.
- Best linear predictor. Restricting the projection to affine functions of Y gives the linear regression coefficient β = Cov(X,Y)/Var(Y) — the BLP. Equal to E[X|Y] only when (X, Y) is jointly Normal.
JavaScript — compute E[X|Y] for samples
// Nadaraya-Watson kernel estimator for E[X|Y=y]
function conditionalMean(samples, y0, bandwidth = 0.5) {
// samples: array of [x, y] pairs
let num = 0, den = 0;
for (const [x, y] of samples) {
const u = (y - y0) / bandwidth;
const w = Math.exp(-0.5 * u * u); // Gaussian kernel
num += x * w;
den += w;
}
return den === 0 ? NaN : num / den;
}
// Tower property check: E[E[X|Y]] should ≈ E[X]
function checkTower(samples) {
const ys = samples.map(s => s[1]);
const xs = samples.map(s => s[0]);
const eX = xs.reduce((a, b) => a + b, 0) / xs.length;
// Estimate E[X|Y] at each Y_i, then average
const condMeans = ys.map(y => conditionalMean(samples, y));
const eEXY = condMeans.reduce((a, b) => a + b, 0) / condMeans.length;
return { eX, eEXY, diff: Math.abs(eX - eEXY) };
}
// Example: simulate (X, Y) jointly Normal with ρ=0.6
function gauss() {
const u = 1 - Math.random(), v = Math.random();
return Math.sqrt(-2 * Math.log(u)) * Math.cos(2 * Math.PI * v);
}
const samples = [];
for (let i = 0; i < 5000; i++) {
const z1 = gauss(), z2 = gauss();
const y = 70 + 15 * z1;
const x = 170 + 10 * (0.6 * z1 + Math.sqrt(1 - 0.36) * z2);
samples.push([x, y]);
}
// At y=85, theoretical E[X|Y=85] = 170 + 0.4*15 = 176
console.log(conditionalMean(samples, 85)); // ≈ 176
console.log(checkTower(samples)); // diff small relative to E[X]
Proof sketch — best mean-square predictor
Claim: among all measurable functions g(Y), the function g*(Y) = E[X|Y] minimises E[(X − g(Y))²].
Add and subtract:
E[(X − g(Y))²] = E[(X − E[X|Y] + E[X|Y] − g(Y))²]
= E[(X − E[X|Y])²] + 2 E[(X − E[X|Y])(E[X|Y] − g(Y))] + E[(E[X|Y] − g(Y))²].
The cross term vanishes by the defining identity: E[X|Y] − g(Y) is a function of Y, and the residual X − E[X|Y] integrates to 0 against any function of Y. So
E[(X − g(Y))²] = E[(X − E[X|Y])²] + E[(E[X|Y] − g(Y))²] ≥ E[(X − E[X|Y])²].
Equality iff g(Y) = E[X|Y] almost surely. The first term is the irreducible noise (the conditional variance averaged over Y); the second is the predictor's distance from the optimum. This is the Hilbert-space Pythagoras theorem applied to the projection.
Where conditional expectation shows up
- Regression analysis. The regression function is exactly the conditional expectation E[Y|X]. Linear regression estimates a parametric form; modern methods (random forests, kernel smoothing, neural nets) estimate it nonparametrically.
- Martingales and finance. No-arbitrage pricing represents the price of a derivative as the conditional expectation of its payoff under a risk-neutral measure. Black-Scholes is a conditional expectation calculation.
- Stochastic processes. Markov chains, Brownian motion, Lévy processes — all defined or analysed via conditional expectations on filtrations.
- Bayesian inference. The posterior mean E[θ | data] is the conditional expectation; it is the Bayes estimator under squared-error loss.
- Variance reduction in simulation. Rao-Blackwellisation replaces a Monte Carlo estimator X by E[X|Y] for an auxiliary Y. The replacement has the same mean and smaller variance.
- Causal inference. The Average Treatment Effect on the Treated is a conditional expectation: E[Y(1) − Y(0) | treated]. Propensity-score methods rely on conditional-expectation identities.
- Reinforcement learning. The value function V(s) is E[return | start in state s]; the Q-function is E[return | (s, a)]. Bellman equations are recursive identities on conditional expectations.
Common pitfalls
- Treating E[X|Y] as a number. It is a random variable. Statements like "E[X|Y] = 5" only make sense almost surely (i.e. when Y deterministically takes a value where the conditional mean is 5).
- Confusing E[X|Y = y] with E[X|Y]. The first is a number for each y; the second is a random variable. Conditioning on the value vs conditioning on the random variable.
- Assuming linearity of E[X|Y] in Y. True only for jointly Normal (X, Y). In general, the regression function can be highly nonlinear.
- Forgetting the conditional variance. The mean-square error of predicting X from Y is E[Var(X|Y)], not zero. Even the best predictor has irreducible noise.
- Confusing zero correlation with conditional independence. Cov(X, Y) = 0 does not imply E[X|Y] = E[X]. Counter-example: Y ~ Normal(0, 1), X = Y². Then Cov(X, Y) = 0, but E[X|Y] = Y² ≠ 1 = E[X].
- Conditioning on measure-zero events. P(Y = y) = 0 for continuous Y, so "given Y = y" only makes sense as a function of y. The Borel paradox shows that naive conditioning on different parameterisations of the same event can give different answers.
Applications across math, stats, and engineering
Linear regression
The least-squares estimator β̂ in OLS estimates the conditional expectation function E[Y|X], assuming it is linear. The Gauss-Markov theorem says OLS is BLUE — the best linear unbiased estimator of E[Y|X]. When the joint is Gaussian, BLUE coincides with the true conditional expectation; otherwise it is the best linear approximation.
Black-Scholes pricing
A European call's price is C = e^{−rT} E^Q[(S_T − K)_+], the discounted conditional expectation of the payoff under the risk-neutral measure Q given current price S_0. The conditional expectation under Q (not the physical measure) is what makes pricing arbitrage-free. Greeks (Delta, Gamma, Vega) are partial derivatives of this conditional expectation.
Kalman filter
The Kalman filter computes E[state_t | observations_{1:t}] recursively for linear-Gaussian state-space models. The update step is exactly the conditional-expectation formula for jointly Normal vectors. Extends to extended/unscented Kalman filters for nonlinear systems.
Reinforcement learning
The state-value function V^π(s) = E^π[Σ_t γ^t r_t | S_0 = s] is a conditional expectation. The Bellman equation V(s) = E[r + γ V(S')| S = s] is a fixed-point identity using the tower property — average over the next state given the current.
EM algorithm
Expectation-Maximisation iterates two steps: E-step computes E[log p(X, Z | θ) | X, θ^{old}] — a conditional expectation — and M-step maximises it in θ. Used in mixture models, hidden Markov models, factor analysis. Each iteration monotonically improves the data likelihood.
Frequently asked questions
Why is E[X|Y] a random variable, not a number?
E[X|Y = y] is a number for each fixed y — the conditional mean given that specific value. But Y itself is random, so when we condition on Y as a random variable (not a fixed value), the resulting object E[X|Y] depends on the random outcome of Y. It is the function y ↦ E[X|Y = y] composed with Y. So E[X|Y](ω) = E[X|Y = Y(ω)]. By construction it is a function of Y, meaning measurable with respect to the σ-algebra generated by Y. This distinction matters: E[X|Y] = 0 is a statement about a random variable being almost surely zero, not about a single number.
What is the tower property and why is it useful?
The tower property — also called the law of total expectation, or iterated expectations — says E[E[X|Y]] = E[X]. Take a conditional expectation, then average over Y: you recover the unconditional expectation. This is the workhorse identity for computing E[X] indirectly. If X is hard to integrate but the conditional law of X given some auxiliary Y is tractable, compute E[X|Y = y] in closed form for each y, then average over the law of Y. Examples: gambler's ruin probabilities (condition on first step), branching process means (condition on number of offspring in generation one), random sums (Wald's identity).
How is conditional expectation related to regression?
E[X|Y] is the best mean-square predictor of X from Y — the function g(Y) that minimizes E[(X − g(Y))²]. Linear regression restricts g to affine functions a + bY, giving a + bY = E[X] + (Cov(X,Y)/Var(Y))(Y − E[Y]). When (X, Y) are jointly Normal, the best predictor is exactly affine, so linear regression equals conditional expectation in that case. For non-Gaussian data, conditional expectation can be highly nonlinear, and modern regression methods (kernel regression, neural networks, random forests) try to estimate it directly from data.
What's the conditional variance decomposition?
Var(X) = E[Var(X|Y)] + Var(E[X|Y]). Total variance equals the average within-Y variance plus the variance between Y groups. This is the law of total variance — the probabilistic analog of ANOVA's within-versus-between decomposition. Useful for designing experiments (decompose noise sources), for variance reduction in Monte Carlo (Rao-Blackwellization replaces X by E[X|Y] to reduce variance while preserving mean), and for understanding R² in regression (R² = Var(E[Y|X])/Var(Y) = fraction of variance explained by conditioning).
How does conditional expectation define a martingale?
A sequence (Xₙ) adapted to a filtration (Fₙ) is a martingale if E[X_{n+1} | Fₙ] = Xₙ for all n. Conditional expectation given the past history is exactly the current value — "fair game" property. Submartingales (E[X_{n+1} | Fₙ] ≥ Xₙ) drift up; supermartingales drift down. The whole martingale toolkit — optional stopping theorem, martingale convergence, Doob's inequalities — is built on the algebra of conditional expectation. Brownian motion, random walks, and many gambling/finance processes are martingales after suitable rescaling.
How do you compute E[X|Y] in practice?
For discrete (X, Y), E[X|Y = y] = Σ_x x · P(X = x | Y = y) — a weighted average over the conditional pmf. For continuous (X, Y) with joint density f(x, y), E[X|Y = y] = ∫ x · f(x, y)/f_Y(y) dx, where f_Y is the marginal of Y. For Gaussians, the formula is closed-form affine: E[X|Y] = μ_X + (Σ_XY Σ_YY⁻¹)(Y − μ_Y). For arbitrary distributions where the joint is given by samples (e.g. data), estimate E[X|Y] by local averaging (k-NN, kernel smoothing) or by fitting a flexible regression model. In any case, the orthogonal projection characterization gives you a check: residuals X − E[X|Y] should be uncorrelated with every function of Y.
Conditional vs marginal vs joint — and best predictors
Where conditional expectation fits in the family of probabilistic objects.
| Object | Definition | Type | Computation | Use case |
|---|---|---|---|---|
| Marginal E[X] | ∫ x f_X(x) dx | Single number | Average X over all outcomes | Long-run mean, baseline |
| Conditional mean E[X|Y = y] | ∫ x f(x|y) dx | Function of y | For each y, average X over conditional law | Predict X given specific y |
| E[X|Y] (rv) | g(Y) where g(y) = E[X|Y=y] | Random variable, measurable wrt σ(Y) | Apply g to the realization of Y | Tower property, martingales, projections |
| Best linear predictor | a + bY minimising E[(X − a − bY)²] | Affine function of Y | b = Cov(X,Y)/Var(Y); a = E[X] − b E[Y] | OLS / linear regression |
| Joint distribution | f(x, y) | Density on ℝ² | Sum/integrate to recover marginals | Full dependence structure |
| Conditional variance Var(X|Y) | E[(X − E[X|Y])² | Y] | Random variable | Apply variance to conditional law | Heteroscedasticity, GARCH |
| Conditional probability P(A|Y) | E[1_A | Y] | Random variable in [0, 1] | Special case of E[X|Y] with X = 1_A | Bayesian classifiers, decision rules |