ECSE 506: Stochastic Control and Decision Theory

Note: To be consistent with the notation used in linear systems, we denote the state and action by lowercase $$x$$ and $$u$$, even for stochastic systems (unlike the notation used for other models where we use uppercase $$X$$ and $$U$$ for state and actions to emphasize the fact they are random variables).

We start by considering a determinisitc linear system with state $$x_t \in \reals^n$$ and control actions $$u_t \in \reals^m$$ and dynamics $x_{t+1} = A_t x_t + B_t u_t,$ where $$A_t \in \reals^{n \times n}$$ and $$B_t \in \reals^{n \times m}$$ are known matrices. The objective is to choose the control actions to minimize the finite horizon cost given by $\sum_{t=1}^{T-1} c_t(x_t, u_t) + c_T(x_T),$ where $$c_t(x_t, u_t)$$ is the per-step cost and $$c_T(x_T)$$ is the terminal cost. Depending on the cost, such models can be classified as follows:

• Regulation problem where the objective is to keep the state of the system close to origin. These are modeled by considering that there are sequence of positive semi-definite matrices $$\{Q_t\}_{t=1}^{T}$$ and positive definite matrices $$\{R_t\}_{t=1}^{T-1}$$ where $c_t(x_t, u_t) = x_t^\TRANS Q_t x_t + u_t^\TRANS R_t u_t \quad\text{and}\quad c_T(x_T) = x_T^\TRANS Q_T x_T.$ This is often referred to as the Linear Quadratic Regulartor (LQR).

• Tracking problem where it is assumed that the system has an output $$y_t = C_t x_t$$ and the objective is to keep the output of the system close to a pre-specified trajectory $$\{y^\circ_t\}_{t=1}^T$$. These are modeled by considering the per-step cost as \begin{align*} c_t(x_t, u_t) &= (C_tx_t - y^\circ_t)^\TRANS Q_t (C_tx_t - y^\circ_t) + u_t^\TRANS R_t u_t \\ \text{and}\quad c_T(x_T) &= (C_Tx_T - y^\circ_T)^\TRANS Q_T (C_Tx_T - y^\circ_T). \end{align*}

1 Optimal solution of the regulation problem

The LQR problem is one of the few instances where the solution of the Markov decision process can be obtained in closed form.

Theorem 1

The value function at time $$t$$ is $$$\label{eq:Vt} V_t(x_t) = x_t^\TRANS S_t x_t$$$ and the optimal control action is $$$\label{eq:gt} u_t = - L_t x_t$$$ where the gain matrices $$\{L_t\}_{t\ge 1}$$ are given by: $L_t = [R_t + B_t^\TRANS S_{t+1} B_t]^{-1} \Lambda_t$ where $\Lambda_t = B_t^\TRANS S_{t+1} A_t$ and $$\{S_t\}_{t \ge 1}$$ are determined by the backward Riccati difference equations: $$S_T = Q_T$$ and for $$t \in \{T-1, \dots, 1\}$$: $$$\label{eq:riccati} S_t = A_t^\TRANS S_{t+1} A_t + Q_t - \Lambda_t^\TRANS [ R_t + B_t^\TRANS S_{t+1} B_t ] ^{-1} \Lambda_t.$$$

Side remark

Riccati equations are named after Jacopo Riccati (1670–1754) who studied the differential equations of the form $\dot x = a x^2 + b t + c t^2$ and its variations. In modern control, such equations arise in the calculus of variations and optimal filtering. The discrete-time version of these equations are also named after Riccati.

Since the LQR model is really simple, there are multiple ways in which the above result can be proved. We first present a dynamic programming based proof. The proof relies on the following completion of squares lemma.

Lemma

(Completion of squares)

For any $$x \in \reals^n$$ and $$u \in \reals^m$$ and matrices $$A$$, $$B$$, $$S$$, and $$R$$ of appropriate dimensions, we have \begin{align*} &u^\TRANS R u + (Ax + Bu)^\TRANS S (Ax + Bu) \\ &\qquad = (u + L x)^\TRANS [R + B^\TRANS S B] (u + L x) + x^\TRANS K x \end{align*} where

• $$K = A^\TRANS S A - \Lambda^\TRANS[R + B^\TRANS S B]^{-1} \Lambda_t$$
• $$L = [R + B^\TRANS S B]^{-1} \Lambda$$
• $$\Lambda = B^\TRANS S A$$

Proof

The proof follows immediately by completing the square on the left hand side. In particular

\begin{align*} & u^\TRANS R u + (Ax+Bu)^\TRANS S (Ax + Bu) \\ & \quad = u^\TRANS (R + B^\TRANS S B) u + 2 u^\TRANS B^\TRANS S A x + x^\TRANS A^\TRANS S A x \\ & \quad = u^\TRANS [R + B^\TRANS S B] u + 2 u^\TRANS B^\TRANS S A x + x^\TRANS L^\TRANS [R + B^\TRANS S B] L x + x^\TRANS L x \\ & \quad = (u + L x)^\TRANS [R + B^\TRANS S B] (u + Kx) + x^\TRANS L x. \end{align*}

Proof of dynamic programming decomposition

The proof of the dynamic program now follows from backward induction.

For $$t=T$$, we have $V_T(x) = c_T(x) = x^\TRANS Q_T x$ which forms the basis of induction. Assume that \eqref{eq:Vt} holds for $$t+1$$ and consider the value function at time $$t$$. \begin{align*} V_{t}(x) &= \min_{u \in \reals^m} \Big\{ x^\TRANS Q_t x + u^\TRANS R_t u + V_{t+1}(Ax + Bu) \Big\} \\ &\stackrel{(a)}= \min_{u \in \reals^m} \Big\{ x^\TRANS Q_t x + u^\TRANS R_t u + (Ax + Bu)^\TRANS S_{t+1} (Ax + Bu) \Big\} \\ &\stackrel{(b)}= \min_{u \in \reals^m} \Big\{ x^\TRANS Q_t x + (u + L_t x)^\TRANS [B_t^\TRANS S_{t+1} B_t + R_t] (u+ L_tx) + x^\TRANS L_t x \Big\} \\ &\stackrel{(c)}=x^\TRANS (Q_t + L_t) x \end{align*} where $$(a)$$ follows from the induction hypothesis, $$(b)$$ follows from completion of squares lemma with $$L_t$$ defined similar to the lemma, and $$(c)$$ follows from minimizing over $$u$$, where the minima is achieved by choosing $$u = -L_t x$$. This proves the induction step.

Now consider the (simplified form of) LQR tracking problem where we want to ensure that the state signal $$\{x_t\}_{t \ge 1}$$ is close to a reference trajectory $$\{x^\circ_t\}_{t \ge 1}$$. The per step cost function is $c_t(x_t, u_t) = (x_t - x^\circ_t)^\TRANS Q_t (x_t - x^\circ_t) + u_t^\TRANS R_t u_t$ and the terminal cost is $c_T(x_T) = (x_T - x^\circ_T)^\TRANS Q_T (x_T - x^\circ_T).$

Theorem 2

The value function at time $$t$$ is $V_t(x) = x^\TRANS S_t x - 2 x^\TRANS r_t + ρ_t$ and the optimal control action is $u_t = - L_t x_t + L^\circ_t r_{t+1}$ where $$\{S_t\}_{t=1}^T$$ and $$\{L_t\}_{t=1}^T$$ follow the same recursions as before. The gain matrices $$\{L^\circ_t\}$$ are given by $L^\circ_t = [R_t + B_t^\TRANS S_{t+1} B_t]^{-1} B_t^\TRANS$ and the correction terms $$\{r_t\}$$ are given by \begin{align*} r_T &= Q_T x^\circ_T \\ r_t &= [A_t - B_t L_t]^\TRANS r_{t+1} + Q_t x^\circ_t \end{align*} and the tracking error $$\{ρ_t\}_{t=1}^T$$ is given by \begin{align*} ρ_T &= (x^\circ_T)^\TRANS Q_T x^\circ_T, \\ ρ_t &= (x^\circ_t)^\TRANS Q_t x^\circ_t - 2 r^\TRANS_{t+1} B_t [R_t + B_t^\TRANS S_{t+1} B_t]^{-1} B_t^\TRANS r_{t+1} + ρ_{t+1}. \end{align*}

The proof follows from backward induction and basic algebra and is left as an exercise.

Now consider a system with stochastic dynamics $x_{t+1} = A_t x_t + B_t u_t + w_t$ where $$\{x_1,w_1,\dots,w_T\}$$ are independent random variables with zero mean and finite variance given by $$\EXP[w_t^\TRANS w_t] = \Sigma^w_t$$.

Theorem 3

The value function at time $$t$$ is $$$\label{eq:Vt-s} V_t(x_t) = x_t^\TRANS S_t x_t + α_t$$$ and the optimal control action is $$$\label{eq:gt-s} u_t = - L_t x_t$$$ where the matrices $$\{L_t\}_{t\ge 1}$$ and $$\{S_t\}_{t \ge 1}$$ follow the same recursion as before and the sequence $$\{α_t\}_{t=1}^T$$ is computed in a backward manner as follows: $$α_T = 0$$ and $α_t = α_{t+1} + \TR(Σ^w_t S_{t+1}) = \sum_{τ=t+1}^{T-1} \TR(Σ^w_t S_{t+1}).$

The proof is similar to the deterministic case and relies on the following observation.

Lemma

For any deterministic value of $$x \in \reals^n$$ and a random zero mean variable $$w \in \reals^n$$ with finite variance given by $$\EXP[w^\TRANS w] = Σ^w$$, we have $\EXP[ (x + w)^\TRANS Q (x+w) ] = x^\TRANS Q x + \TR(Q Σ^w).$

Proof

Note that

\begin{align*} \EXP[ (x+w)^\TRANS Q (x+w) ] &= \EXP[ x^\TRANS Q x + 2 x^\TRANS Q w + w^\TRANS Q w ] \\ &= x^\TRANS Q x + \TR(Σ^w Q). \end{align*} where the second term is zero because $$w$$ is a zero mean random variable. For the third term, we are using \begin{align*} \EXP[ w^\TRANS Q w] &= \EXP\bigg[ \sum_{i=1}^n \sum_{j=1}^n w_i Q_{ij} w_j \bigg] \\ &= \sum_{i=1}^n \sum_{j=1}^n \EXP[w_i w_j] Q_{ij} \\ &= \sum_{i=1}^n \sum_{j=1}^n Σ^w_{ij} Q_{ij} \\ &= \TR(Σ^w Q) \end{align*}

Proof of stochastic LQR

Using the above lemma, we can prove the result for stochastic LQR using backward induction. The details are left as an exercise.

3.1 A remark on “white” noise

An implication of Theorem 3 is that the presence of “white” stochastic disturbance in the system dynamics does not change the optimal control rule (in closed-loop form) and increases the cost only by a term independent of the state or the policy.

Suppose the noise was not white (but still independent of the initial state $$x_1$$). Then, under assumptions on the linear/Gausian nature of the observations, the optimal control at time $$t$$ for the system will be the same as that for the deterministic system $x_{τ + 1} = A x_τ + B u_τ + w_{τ|t}, \qquad τ \ge t,$ where $$w_{τ|t}$$ is an appropriate estimate of $$w_τ$$ based on the information availalbe at time $$t$$. That is, at time $$t$$ one replaces future stochastic noise $$w_τ$$ ($$τ \ge t$$) by an ‘equivalent’ deterministic noise $$w_{τ|t}$$ and then applies the method of deterministic LQR to deduce the optimal feedback control in terms of the predicted noise. This is an instance of a general result known as the certainty equivalence principle, which also extends to the case when the state is not perfectly observed.

4 An alternative proof for stochastic LQR without using dynamic programming

We now present a proof of the stochastic LQR that does not use dynamic programming. This is based on a simple idea of transforming the total cost using the solution of the Riccati equation.

Prop. 1

For any control strategy, the total cost $\EXP\bigg[\sum_{t=1}^{T-1} \big[ x_t^\TRANS Q_t x_t + u_t^\TRANS R_t u_t \big] + x_T^\TRANS Q_T x_T \bigg]$ may be written as \begin{align} & \EXP\bigg[ \sum_{t=1}^{T-1} (u_t + L_t x_t)^\TRANS [R_t + B_t^\TRANS S_{t+1}B_t] (u_t + L_t x_t) \bigg] \nonumber \\ & \quad + \EXP\bigg[ x_1^\TRANS S_1 x_t + \sum_{t=1}^{T-1} w_t S_{t+1} w_t \bigg], \label{eq:astrom} \end{align} where the matrices $$\{L_t\}_{t\ge 1}$$ and $$\{S_t\}_{t \ge 1}$$ follow are given as in Theorem 1.

Proof

This result can be obtained by repeatedly applying the completion of squares lemma. In particular, note that $$S_T = Q_T$$ and using the telescopic sum, we can write $$$\label{eq:astrom-1} x_T^\TRANS Q_T x_T = x_1^\TRANS S_1 x_1 + \sum_{t=1}^{T-1} \big[ x_{t+1}^\TRANS S_{t+1} x_{t+1} - x_t^\TRANS S_t x_t \big].$$$

Furthermore, since $$w_t$$ is independet of $$x_t$$, we have \begin{align} \EXP[ x_{t+1}^\TRANS S_{t+1} x_{t+1}] &= \EXP[ (A_t x_t + B_t u_t + w_t)^\TRANS S_{t+1} (A_t x_t + B_t u_t + w_t) \nonumber \\ &= \EXP[ (A_t x_t + B_t u_t)^\TRANS S_{t+1} (A_t x_t + B_t u_t) ] + \EXP[ w_t^\TRANS S_{t+1} w_t ]. \label{eq:astrom-2} \end{align}

Using \eqref{eq:astrom-1} and \eqref{eq:astrom-2}, we get that the total cost can be written as \begin{align} & \EXP\bigg[ \sum_{t=1}^{T-1} \big[ x_t^\TRANS Q_t x_t + u_t^\TRANS R_t u_t + x_{t+1}^\TRANS S_{t+1} x_{t+1} - x_t^\TRANS S_t x_t \big] \nonumber \\ & \quad + \EXP\bigg[ x_1^\TRANS S_1 x_1 + \sum_{t=1}^{T-1} w_t S_{t+1} w_t \bigg]. \label{eq:astrom-3} \end{align}

Now, from the completion of squares lemma, we get that “Term 1” is equal to $(u_t + L_t x_t)^\TRANS [R_t + B_t^\TRANS S_{t+1} B_t] (u_t + L_t x_t).$ Substituting this back in \eqref{eq:astrom-3}, we get \eqref{eq:astrom}.  $$\Box$$

Now we can prove Theorem 3 using Prop. 1. Note that the second term in \eqref{eq:astrom} does not depend on the choice of control actions $$u_t$$. Thus, in order to minimimze the total expected cost, it suffices to minimize the first term of \eqref{eq:astrom}. Since $$R_t$$ is positive definite and $$S_{t+1}$$ is positive semi-definite, $$R_t + B_t^\TRANS S_{t+1} B_t$$ is positive definite. Thus, the first term of \eqref{eq:astrom} is sum of squares. Choosing $u_t = -L_t x_t$ sets this term to its minimum value of $$0$$. Hence, $$u_t = -L_t x_t$$ is the optimal control strategy.

5 Alternative forms of the Riccati equation

For a fixed $$A$$, $$B$$, $$Q$$, and $$R$$ matrices, define the operator $$\mathcal{R}$$ as $\mathcal{R}S = A^\TRANS S A + Q - A^\TRANS S B[ R + B^\TRANS S B]^{-1} B^\TRANS S A.$

With this notation, we can define the recursive computation of $$\{S_t\}_{t=1}^T$$ as follows: $$S_T = Q_T$$ and $$S_t = \mathcal R_t S_{t+1}$$. In this section, we derive alternative characterizations of the operator $$\mathcal{R}$$.

Lemma 1

The following form are equivalent to the Riccati operator:

1. $$A^\TRANS S(I + B R^{-1}B^\TRANS S)^{-1} A + Q$$.
2. $$A^\TRANS (S^{-1} + B R^{-1}B)^{-1} A + Q$$.

Proof

The proof of the first part relies on the simplified form of the Sherman-Morrison-Woodbudy formula:

$(I + UV)^{-1} = I - U(I + VU)^{-1} V.$

Taking $$U = BR^{-1}$$ and $$V = B^\TRANS S$$, we get

\begin{align} (I + BR^{-1}B^\TRANS S)^{-1} &= I - B R^{-1}(I + B^\TRANS S B R^{-1})^{-1} B^\TRANS S \notag \\ &= I - B(R + B^\TRANS S B)^{-1} B^\TRANS S. \label{eq:SMW} \end{align}

Now multiplying both sides by $$A^\TRANS(\cdots)A$$ and adding $$Q$$, we get the first formula. Multiplying $$S$$ inside the inverse gives us the second formula. $$\Box$$

References

See Athans (1971) for a general discussion of philosophical approach of approximating general stochastic control problem as linear quadratic models.

The term certainty equivalence is due to Simon (1956), who was looking at a static problem; a similar result had earlier been shown by Theil (1954). A result which is essentially equivalent to the stochastic LQR problem is proved by Theil (1957). The model for deterministic LQR is due to Kalman (1960), who proved the result for continuous time systems.

The alternative proof that does not use dynamic programming is due to Aström (1970).

Aström, K.J. 1970. Introduction to stochastic control theory. Dover.
Athans, M. 1971. The role and use of the stochastic linear-quadratic-gaussian problem in control system design. IEEE Transactions on Automatic Control 16, 6, 529–552. DOI: 10.1109/tac.1971.1099818.
Kalman, R.E. 1960. Contributions to the theory of optimal control. Boletin de la Sociedad Matematica Mexicana 5, 102–119.
Simon, H.A. 1956. Dynamic programming under uncertainty with a quadratic criterion function. Econometrica 24, 1, 74–81. DOI: 10.2307/1905261.
Theil, H. 1954. Econometric models and welfare maximization. Wirtschaftliches Archiv 72, 60–83. DOI: 10.1007/978-94-011-2410-2_1.
Theil, H. 1957. A note on certainty equivalence in dynamic planning. Econometrica, 346–349. DOI: 10.1007/978-94-011-2410-2_3.

This entry was last updated on 13 Dec 2020 and posted in MDP and tagged linear systems, riccati equation, lqr, optimal tracking.