30 Linear filtering
filtering, Kalman filtering
30.1 Hilbert space of random variables
Let \((Ω, \ALPHABET F, \PR)\) be a probability space. Let \(L_2(Ω, \ALPHABET F, \PR)\) denote the family of all \(\reals\)-valued random variables on \((Ω, \ALPHABET F, \PR)\) that have finite second moments.
It can be easily seen that \(L_2(Ω, \ALPHABET F, \PR)\) is a vector space with addition and scalar multiplication defined in the usual way for functions.
We can make \(L_2(Ω, \ALPHABET F, \PR)\) an inner product space with \[ \IP{x}{y} \coloneqq \EXP[XY]. \] The associate \(L_2\)-norm is given by \[ \NORM{x}_{2} = \IP{x}{x}^{1/2} = \sqrt{ \EXP[x^2] }. \]
It can be shown that \(L_2(Ω, \ALPHABET F, \PR)\) is complete (e.g., from the completeness of \(L_p\) spaces) and therefore an Hilbert space.
The above argument also works for random vectors. With a slight abuse of notation, we also use \(L_2(Ω, \ALPHABET F, \PR)\) to denote all \(\reals^n\)-valued random variables that have finite second moments. As before, we can show that \(L_2(Ω, \ALPHABET F, \PR)\) is a vector space.
Define the inner product \[ \IP{x}{y} \coloneqq \EXP[ \IP{x}{y} ] = \EXP[ x^\TRANS y ].\] The corresponding norm is \[ \NORM{x}_{2} = \IP{x}{x}^{1/2} = \sqrt{ \EXP[\NORM{x}^2] }. \] Again, we can show that the corresponding inner product space is complete and hence a Hilbert space.
30.2 Best linear unbiased estimator (BLUE)
Let \(x\), \(y_1\), \(y_2\), \(\dots\), \(y_t\) be \(\reals\) valued random varaibles with finite second moments defined on a common probability space \((Ω, \ALPHABET F, \PR)\). Define \[ \ALPHABET L(y_{1:t}) = \{ a_0 + a_1 y_1 + \cdots + a_t y_t : a_0, a_1, \dots, a_t \in \reals \} \] to be the linear span of \(y_{1:t}\), which is a finite-dimensional Hilbert subspace of \(L_2(Ω, \ALPHABET F, \PR)\).
Define \(\hat x = \LEXP[ x \mid \ALPHABET L(y_{1:t}) ]\) to be the orthogonal projection of \(x\) onto \(\ALPHABET L(y_{1:t})\) (see Theorem 46.1). From properties of orthogonal projection, we know that \(\hat x\) is the of \(x\)with respect to \(y_{1:t}\) in the sense that for any \(w \in \ALPHABET L(y_{1:t})\), \[ \NORM{ x - w } \ge \NORM{x - \hat x} ≡ \EXP[ \NORM{x - w}^2 ] \le \EXP[ \NORM{ x - \hat x}^2 ]. \] For this reason, \(\hat x\) is also called the least squares estimate
Let \(\{1, v_1, v_2,\dots,\}\) be an orthonormal basis of \(\ALPHABET L(y_{1:t})\) (which can be obtained via Gram-Schmidt orthogonalization procedure). Note that when all vectors are zero mean, we do not need to include the constant \(1\) in the basis. Then, \[ \hat x = \IP{x}{1} 1 + \IP{x}{v_1} v_1 + \IP{x}{v_2} v_2 + \cdots. \] Since \((v_1, v_2, \dots,)\) are orthogonal to \(1\), we have \(\IP{v_i}{1} = 0\). Thus, \[ \IP{\hat x}{1} = \IP{x}{1} \] or \[ \EXP[ \hat x ] = \EXP[x]. \] Hence, \(\hat x\) is an unbiased estimator.
Another implication is that the error is orthogonal to the estimate, i.e., \[ \IP{\hat x}{x - \hat x} = 0 \implies \EXP[ \hat x (x - \hat x) ] = 0. \] Thus, the error is uncorrelated with the estimate.
The above argument extends to the vector case in a natural manner. To develop a formula for the best linear estimate, we present a slightly different argument below.
Let \(x \in \reals^{d_x}\) and \(y \in \reals^{d_y}\) be zero-mean random variables defined on a common probability space that have finite second moments. The zero-mean assumption is imposed for simplicity.
Let \(\ALPHABET H\) be the Hilbert space of \(\reals^{d_x}\)-dimensional random variables with finite second moments equipped with the inner product \[ \IP{x_1}{x_2} = \EXP[ x_1^\TRANS x_t]. \] Let \(\ALPHABET L(y)\) be the linear \(d_x\)-dimensional subspace spanned by \(y\), i.e., \[ \ALPHABET L(y) = \{ K y : K \in \reals^{d_x × d_y} \}. \] Note that since the random variables are zero-mean, we are not including the subspace spanned by constant vectors. Observe that \(\ALPHABET L(y)\) is a subspace of \(\ALPHABET H\).
Now consider the problem of finding the projection \(\hat x\) of \(x\) onto \(\ALPHABET L(y)\). Since \(\hat x \in \ALPHABET L(y)\), it must be of the form \(K y\) for some \(K_\circ \in \reals^{d_x × d_y}\). We will now show how to compute \(K\).
Let \[ Σ_{xx} = \VAR(x), \quad Σ_{yy} = \VAR(y), \quad Σ_{xy} = \COV(x,y). \] By the orthogonal projection theorem (Theorem 46.1), \(\hat x\) is unique and must satisfy \[ \IP{x - \hat x}{w } = 0, \quad \forall w \in \ALPHABET L(y). \] Any \(w \in \ALPHABET L(y)\) will be of the form \(K y\) for some \(K \in \reals^{d_x × d_y}\). Thus, the above equation is equivalent to \[\begin{equation}\label{eq:op-1} \EXP[ (x - K_\circ y)^\TRANS K y ] = 0, \quad \forall K_w \in \reals^{d_x × d_y}. \end{equation}\]
We claim that \(\eqref{eq:op-1}\) is equivalent to \[\begin{equation}\label{eq:op-2} \EXP[ (x - K_\circ y) y^\TRANS ] = \mathbf{0}_{d_x × d_y}. \end{equation}\]
Observe that \[ (x - K_\circ y)^\TRANS K y ] = \sum_{i=1}^{d_x} [ x - K_\circ y ]_i [ K y]_i = \sum_{i=1}^{d_x} \sum_{j=1}^{d_y} K_{ij} [ x - K_\circ y ]_i y_j. \] Thus, \[\begin{equation}\label{eq:op-3} \EXP[ (x - K_\circ y)^\TRANS K y ] = \sum_{i=1}^{d_x} \sum_{j=1}^{d_y} K_{ij} \EXP\bigl[ [ x - K_\circ y ]_i y_j \bigr]. \end{equation}\]
We now show that \(\eqref{eq:op-1}\) implies \(\eqref{eq:op-2}\). Pick a \((i',j')\) and set \(K_{ij} = δ_{ii'} δ_{jj'}\). Then, \(\eqref{eq:op-1}\) and \(\eqref{eq:op-3}\) imply that \[ \EXP\bigl[ [ x - K_\circ y ]_{i'} y_{j'} \bigr] = 0. \] Since \((i',j')\) was arbitrary, the above equation is true for all \((i',j')\). Writing this in vector form gives \(\eqref{eq:op-2}\).
We now show the other direction. If \(\eqref{eq:op-2}\) is true, then each term in the summand in the right hand side of \(\eqref{eq:op-3}\) is zero, and hence \(\eqref{eq:op-1}\) is true.
Thus, the claim implies that \[ Σ_{xy} = \EXP[ x y^\TRANS] = K_\circ \EXP[ y y^\TRANS ] = K_\circ Σ_{yy}. \]
Summarizing the above discussion, we get the following.
Theorem 30.1 (Best Linear Estimation) Let \(x \in \reals^{d_x}\) and \(y \in \reals^{d_y}\) be zero-mean random variables defined on a common probability space that have finite second moments. Let \[ Σ_{xx} = \VAR(x), \quad Σ_{yy} = \VAR(y), \quad Σ_{xy} = \COV(x,y). \]
Then, the best linear estimator \(\hat x\) of \(x\) given \(y\) is given by \[ \hat x = \LEXP[ x \mid \ALPHABET L(y) ] = K_\circ y \] where \(K_\circ\) satisfies the so called normal equation \[ Σ_{xy} = K_\circ Σ_{yy}. \] The corresponding error covariance is given by \[ \EXP[ (x - \hat x)^\TRANS (x - \hat x) ] = Σ_{xx} - K_\circ Σ_{xy} \]
When \(Σ_{yy} > 0\) the solution is unique and is given by \[ K_\circ = Σ_{xy} Σ_{yy}^{-1}. \] The corresponding mean squared error can be simplied to \[ \EXP[ (x - \hat x)^\TRANS (x - \hat x) ] = Σ_{xx} - Σ_{xy} Σ_{yy}^{-1} Σ_{xy} \] which is the Schur complement of \(Σ_{yy}\) in the joint covariance matrix \[ Σ = \MATRIX{ Σ_{xx} & Σ_{xy} \\ Σ_{yx} & Σ_{yy} }. \]
When \(Σ_{yy}\) is singular, the normal equation \(K_\circ Σ_{yy} = Σ_{xy}\) will be consistent and have multiple solutions. For any choice of a solution \(K_\circ\), the best linear estimate \(K_\circ y\) is unique.
30.3 Recursive least squares filtering
To be written