5 Conditional expectation for Gaussians

Author

Affiliation

Aditya Mahajan

McGill University

Updated

December 1, 2025

5.1 Review of Gaussian random variables

A multivariate Gaussian random vector \(X = (X_1, \dots, X_n) \sim \mathcal{N}(\mu, \Sigma)\) has a PDF \[ f_X(x) = \frac{1}{(2\pi)^{n/2} \det(\Sigma)^{1/2}} \exp\left( -\frac{1}{2} (x - \mu)^\TRANS \Sigma^{-1} (x - \mu) \right), \quad x \in \reals^n \] and MGF \[ M_X(s) = \EXP\big[ e^{s^\TRANS X} \big] = \exp\left( s^\TRANS \mu + \frac{1}{2} s^\TRANS \Sigma s \right), \qquad s \in \reals^n. \]
Thus, all moments of a Gaussian random variable can be expressed in terms of its mean and variance. In particular, if \(X \sim \mathcal{N}(0, \sigma^2)\), then all odd moments are zero; all even moments can be written in terms of the variance: \[ \EXP[X^{2k}] = \underbrace{(2k-1)\cdot(2k-3)\cdots3\cdot1}_{\eqqcolon (2k-1)!!}\, \sigma^{2k} \]
Any linear combination of jointly Gaussian random variables is also Gaussian.

More concretely, if \(X = (X_1, \dots, X_n) \sim \mathcal{N}(\mu, \Sigma)\) and \(A \in \reals^{m \times n}\) and \(b \in \reals^m\), then the affine transformation \(A X + b\) is also Gaussian: \[ A X + b \sim \mathcal{N}\big(A\mu + b,\, A\Sigma A^\TRANS\big). \]
As a consequence, marginals of jointly Gassian random variables are also Gaussian.

More concretely, if \(X = (X_1, \dots, X_n) \sim \mathcal{N}(\mu, \Sigma)\), then any subvector formed by selecting a subset of the indices is itself multivariate Gaussian, with the corresponding mean vector and covariance matrix.
Gaussian random variables are independent if and only if they are uncorrelated. In particular, if \(X\) and \(Y\) are (jointly) Gaussian random variables, then \(X\) and \(Y\) are independent if and only if \(\COV(X, Y) = 0\).

For a jointly Gaussian random vector \((X_1, \dots, X_n)\), the components are independent if and only if the covariance matrix is diagonal.

5.2 Conditional expectation for Gaussian random variables

If \(X\) and \(Y\) are jointly Gaussian, then \(X|Y\) is also Gaussian. In particular, suppose
- \(X \sim \mathcal{N}(\mu_X, \sigma_X^2)\)
- \(Y \sim \mathcal{N}(\mu_Y, \sigma_Y^2)\)
- \(\COV(X, Y) = \rho \sigma_X \sigma_Y\), where \(-1 \leq \rho \leq 1\)
then \[ X \mid (Y = y) \sim \mathcal{N}(\mu_{X|Y},\, \sigma^2_{X|Y}) \] where the conditional mean is \[ \mu_{X|Y} = \mu_X + \rho \frac{\sigma_X}{\sigma_Y}(y - \mu_Y) \] and the conditional variance \[ \sigma^2_{X|Y} = \sigma_X^2 (1 - \rho^2) \]
It is possible to derive the above result from the definition of conditional probability by using some tedious algebra. However, we present a different approach here that is based on interpreting conditional expectations as orthogonal projections.
Hilbert space of random variables: The key idea is to construct a Hilbert space of square-integrable random variables. For scalar random variables \(X\) and \(Y\) with \(\EXP[X^2] < ∞\) and \(\EXP[Y^2] < ∞\), we define the inner product as \[ \IP{X}{Y} = \EXP[XY]. \] The corresponding norm is \(\NORM{X}^2 = \EXP[X^2]\). It can be shown that this space is complete (i.e., all Cauchy sequences converge), and is therefore a Hilbert space.

For \(\reals^n\)-valued square-integrable random vectors \(X_1\) and \(X_2\) (i.e., \(\EXP[\|X_1\|^2] < ∞\) and \(\EXP[\|X_2\|^2] < ∞\)), we define the inner product as \[ \IP{X_1}{X_2} = \EXP[X_1^\TRANS X_2]. \] The corresponding norm is \(\NORM{X}^2 = \EXP[\NORM{X}^2]\). As before, this space is complete and forms a Hilbert space.
Orthogonal projection theorem: A fundamental property of Hilbert spaces is the orthogonal projection theorem.

Theorem 5.1 (Orthogonal Projection Theorem) Let \(\ALPHABET M\) be a proper linear subspace of a Hilbert space \(\ALPHABET H\). Then any element \(x \in \ALPHABET H\) can be uniquely represented in the form \[ x = \hat x + e \] where \(\hat x \in \ALPHABET M\) and \(e \perp \ALPHABET M\) (i.e., for any \(w \in \ALPHABET M\), \(\IP{e}{w} = 0\)). This unique element \(\hat x\) is called the orthogonal projection of \(x\) onto \(\ALPHABET M\) and has the property that it minimizes the distance: \[ \NORM{x - \hat x} \le \NORM{x - w}, \quad \forall w \in \ALPHABET M. \]
We also state a Lemma which will be useful later on.

Lemma 5.1 If \(u \in \reals^n\) and \(w \in \reals^m\) are vectors such that \(u^\TRANS K w = 0\) for all matrices \(K \in \reals^{n \times m}\), then \(u w^\TRANS = \mathbf{0}_{n \times m}\).

NoteProof

Note that \[u^\TRANS K w = \sum_{i=1}^n \sum_{j=1}^m K_{ij} u_i w_j.\]

For any \((i', j')\) with \(1 \le i' \le n\) and \(1 \le j' \le m\), choose \(K\) such that \(K_{ij} = δ_{ii'} δ_{jj'}\) (i.e., \(K\) has a \(1\) in position \((i', j')\) and zeros elsewhere). Then \(u^\TRANS K w = u_{i'} w_{j'} = 0\).

Since \((i', j')\) was arbitrary, we have \(u_i w_j = 0\) for all \(i, j\). Therefore, \(u w^\TRANS = \mathbf{0}_{n \times m}\).
Application to Gaussian random vectors: Now consider \(X \in \reals^n\) and \(Y \in \reals^m\) to be jointly Gaussian random vectors, i.e., \[ \MATRIX{X \\ Y} \sim \mathcal{N}\left( \MATRIX{ μ_X \\ μ_Y}, \MATRIX{ Σ_{XX} & Σ_{XY} \\ Σ_{XY}^\TRANS & Σ_{YY}} \right). \] Let \(\ALPHABET M\) denote the proper linear subspace of all affine functions of \(Y\), i.e., \[ \ALPHABET M = \{ K Y + c : K \in \reals^{n \times m}, c \in \reals^n \}. \] Let \(\hat{X}\) denote the orthogonal projection of \(X\) onto \(\ALPHABET M\). From Theorem 5.1, we know that \(\hat{X}\) minimizes the distance, i.e., for any \(W \in \ALPHABET M\), \[ \NORM{X - \hat{X}} \le \NORM{X - W} \iff \EXP[ \|X - \hat{X}\|^2] \le \EXP[ \|X - W\|^2]. \] Thus, \(\hat{X}\) minimizes the mean-squared error. Moreover, by the orthogonality property, the projection error is orthogonal to the subspace: \[ \IP{W}{X - \hat{X}} = 0, \quad \forall W \in \ALPHABET M \iff \EXP[ W^\TRANS (X - \hat{X})] = 0, \quad \forall W \in \ALPHABET M. \]
An immediate implication of the above is that \(\hat X\) is an unbiased estimator of \(X\), i.e., \[ \EXP[\hat{X}] = \EXP[X]. \]

NoteProof

Let \(X_i\) and \(\hat X_i\) denote the \(i\)-th component of the vectors \(X\) and \(\hat X\) and let \(e_i\) denote the unit vector in the \(i\)-th direction. Note that \[ X_i - \hat X_i = \IP{X - \hat X}{e_i}. \] Thus, \[ \EXP[X_i] - \EXP[\hat X_i] = \EXP[ \IP{X - \hat X}{e_i}]. \] Since \(\ALPHABET M\) includes all affine functions of \(Y\), it includes all constants (and hence \(e_i\)). Therefore, \(\EXP[ \IP{X - \hat X}{e_i}] = 0\). Subsituting this in the above equation implies \[ \EXP[X_i] = \EXP[\hat X_i]. \]

Since the relationship holds for all components \(i\), it holds for the vectors as well. Thus, \(\EXP[X] = \EXP[\hat X]\).
The orthogonal projection is the conditional expectation, i.e., \[ \hat{X} = \EXP[X \mid Y]. \]

NoteProof

Since \(\hat{X} \in \ALPHABET M\), it is an affine function of \(Y\). Hence \(X - \hat X\) and \(Y\) are jointly Gaussian. Moreover, since \(K(Y - μ_Y) \in \ALPHABET M\) for all \(K \in \reals^{n \times m}\), we have \[ \EXP[ (X - \hat{X})^\TRANS K(Y - μ_Y)] = 0, \quad \forall K \in \reals^{n \times m}. \] By Lemma 5.1, this implies that \[ \EXP[(X - \hat{X})(Y - μ_Y)^\TRANS] = \mathbf{0}_{n \times m}. \] Since \(X - \hat X\) and \(Y\) are jointly Gaussian and uncorrelated (because \(\EXP[X - \hat X] = 0\)), they are independent. Therefore, for any (measurable) function \(g\), the independence gives: \[ \EXP[\IP{X - \hat{X}}{g(Y)}] = \EXP[X - \hat{X}]^\TRANS \EXP[g(Y)] = 0, \] where the last equality follows from the fact that \(\hat X\) is an unbiased estimator of \(X\).

Now, from last section, we know that if a function \(h(Y)\) is such that \(X - h(Y)\) is orthogonal to all functions of \(Y\), then \(h(Y)\) is the conditional expectation \(\EXP[X \mid Y]\). Hence, \(\hat X = \EXP[X \mid Y]\).
Deriving the formula: To find the explicit formula for \(\EXP[X \mid Y]\), we write \(\hat{X} = K_{\circ} Y + c_\circ\) and find \(K_\circ\) and \(c_\circ\) from properties of orthogonal projection. First note that from the smoothing property of conditional expectation, we have \[ \EXP[ \hat X] = \EXP[ \EXP[ X \mid Y] ] = \EXP[X] \implies K_\circ μ_Y + c_\circ = μ_X \implies c_\circ = μ_X - K_\circ μ_Y. \]

By orthogonal projection theorem, the error \(X - \hat{X} = (X - μ_X) - K_{\circ}(Y - μ_Y)\) must be orthogonal to all elements of \(\ALPHABET M\). In particular, it must be orthogonal to \(Y - μ_Y\) and to constant vectors.

The orthogonality condition with \(K(Y - μ_Y)\) gives: \[\begin{equation}\label{eq:orthogonal} \EXP[( (X - μ_X) - K_{\circ}(Y - μ_Y) )^\TRANS K(Y - μ_Y)] = 0, \quad \forall K \in \reals^{n \times m}. \end{equation}\] By Lemma 5.1, \(\eqref{eq:orthogonal}\) is equivalent to \[\begin{equation}\label{eq:claim} \EXP[( (X - μ_X) - K_{\circ}(Y - μ_Y) )(Y - μ_Y)^\TRANS ] = \mathbf{0}_{n \times m}. \end{equation}\]

Now this implies that \[ Σ_{XY} = \EXP[ (X - μ_X) (Y - μ_Y)^\TRANS ] = K_{\circ} \EXP[ (Y - μ_Y) (Y - μ_Y)^\TRANS] = K_{\circ} Σ_{YY} \] Hence, \[ K_{\circ} = Σ_{XY} Σ_{YY}^{-1} \quad\text{and}\quad c_{\circ} = μ_X - Σ_{XY} Σ_{YY}^{-1} μ_Y. \]

Thus, \[ \bbox[5pt,border: 1px solid]{ \EXP[ X \mid Y] = μ_X + Σ_{XY} Σ_{YY}^{-1}(Y - μ_Y) } \]

Since the conditional mean is an affine function of \(Y\), it is a Gaussian random variable.
The conditional covariance \(\COV(X \mid Y)\) is defined as \[ \COV(X \mid Y) = \EXP[ (X - \EXP[X \mid Y])(X - \EXP[X \mid Y])^\TRANS \mid Y] \] Note that we have already shown that \(X - \hat X\) is orthogonal to every random variable in \(\ALPHABET M\). Thus, \(X - \hat X\) is orthogonal to \(Y\). Hence, \[ \EXP[ (X - \hat X)(X - \hat X)^\TRANS \mid Y] = \EXP[ (X - \hat X) (X - \hat X)^\TRANS]. \] To simplify the calculation, we assume that \(X\) and \(Y\) are zero mean. Then, \(\hat{X} = K_{\circ}Y\), and we have \[\begin{align*} \COV(X \mid Y) &= \EXP[ (X - \hat X) (X - \hat X)^\TRANS] \\ &= \EXP[ X X^\TRANS + K_{\circ} Y Y^\TRANS K_{\circ}^\TRANS - 2 X Y^\TRANS K_{\circ}^\TRANS] \\ &= Σ_{XX} + K_{\circ} Σ_{YY} K_{\circ}^\TRANS - 2 Σ_{XY} K_{\circ}^\TRANS \\ &= Σ_{XX} - Σ_{XY} Σ_{YY}^{-1} Σ_{XY}^\TRANS. \end{align*}\]
Summary: For jointly Gaussian random vectors \((X, Y)\) with the joint distribution given above, the conditional distribution is Gaussian and is given by: \[ \bbox[5pt,border: 1px solid]{ X \mid Y \sim \mathcal{N}\left(μ_X + Σ_{XY} Σ_{YY}^{-1} (Y - μ_Y), \, Σ_{XX} - Σ_{XY} Σ_{YY}^{-1} Σ_{XY}^\TRANS\right) } \]

Exercises

Exercise 5.1 Let \(X \sim \mathcal{N}(μ_X, Σ_X)\) denote the state of nature. We make an observation \[ Y = X + W \] where \(W \sim \mathcal{N}(0, Σ_W)\). Find \(\EXP[X \mid Y]\).

Exercise 5.2 A robot is trying to estimate its position \(X \sim \mathcal{N}(μ_X, Σ_X)\) using two independent sensors. Sensor 1 provides measurement \[ Y_1 = C_1 X + W_1 \] where \(W_1 \sim \mathcal{N}(0, Σ_{W_1})\) is measurement noise and \(C_1\) is a known measurement matrix. Sensor 2 provides measurement \[ Y_2 = C_2 X + W_2 \] where \(W_2 \sim \mathcal{N}(0, Σ_{W_2})\) is independent of \(W_1\) and \(C_2\) is another known measurement matrix.

Find \(\EXP[X \mid Y_1, Y_2]\) and show how the conditional covariance compares to using a single sensor.

5.1 Review of Gaussian random variables

5.2 Conditional expectation for Gaussian random variables

Exercises

Further reading