28 Linear quadratic regulation
LQR, Riccati equation, completion of squares
To be consistent with the notation used in linear systems, we denote the state and action by lowercase \(x\) and \(u\), even for stochastic systems (unlike the notation used for other models where we used uppercase \(S\) and \(U\) for state and actions to emphasize the fact that they are random variables.
Moreover, we follow the control notation of denoting the policy by \(g\) rather than \(π\).
Consider a linear stochastic system with state \(x_t \in \reals^{d_x}\) and controls \(u_t \in \reals^{d_u}\) where the dynamics are given by \[\begin{equation}\label{eq:dynamics} x_{t+1} = A x_t + B u_t + w_t, \quad t \ge 1 \end{equation}\] where \(A \in \reals^{d_x × d_x}\) and \(B \in \reals^{d_x × d_u}\) are known matrices and \(\{w_t\}_{t \ge 1}\), \(w_t \in \reals^{d_x}\) is process noise. The objective is to choose the control policy \(g = (g_1, \dots, g_{T-1})\), where \(u_t = g_t(x_{1:t}, u_{1:t-1})\), to minimize \[ J(g) = \EXP\biggl[ \sum_{t=1}^{T-1} c(x_t, u_t) + c_T(x_T) \biggr] \] where \(c(x_t, u_t)\) is the per-step cost and \(c_T(x_T)\) is the terminal cost. Depending on the cost functions, such models can be classified as follows:
Regulation problem where the objective is to keep the state of the system close to origin. These are modeled by taking \[ c(x_t, u_t) = x_t^\TRANS Q x_t + u_t^\TRANS R u_t \quad\text{and}\quad c_T(x_T) = x_T^\TRANS Q x_T, \] where \(Q, Q_T\) are positive semi-definite matrices and \(R\) is a positive definite matrix. The regulation problem is often referred to as Linear Quadratic Regulator (LQR).
Tracking problem where it is assumed that the system has an output \(y_t = C x_t\) and the objective is to keep the output of the system close to a pre-specified trajectory \(\{r_t\}_{t=1}^T\). These are modeled by considering the per-step cost as \[ \begin{align*} c_t(x_t, u_t) &= (Cx_t - r_t)^\TRANS Q (Cx_t - r_t) + u_t^\TRANS R u_t \\ \text{and}\quad c_T(x_T) &= (C_Tx_T - r_T)^\TRANS Q_T (C_Tx_T - r_T). \end{align*} \]
There are various methods to obtain the optimal controller for LQR: dynamic programming, Lagrange multipliers (co-state), and others. Departing from the dynamic programming approach taken in the rest of these notes, in this section we present the solution of LQR using an elementary algebraic technique called completion of squares.
In the rest of these notes, we consider a slight generalization of LQ regulation where we consider the cost function \[\begin{equation}\label{eq:cost} J(g) = \EXP\left[ \sum_{t=1}^{T-1} \MATRIX{x_t \\ u_t}^\TRANS \MATRIX{Q & S \\ S^\TRANS & R } \MATRIX{ x_t \\ u_t} + x_{T} P_T x_{T} \right] \end{equation}\] where the matrix \(\MATRIX{Q & S \\ S^\TRANS & R}\) is symmetric and positive semi-definite and \(R\) is symmetric and positive definite.
28.1 Solution to linear quadratic regulation
Let \(x \in \reals^{d_x}\) be a random variable with mean \(μ\) and covariance \(Σ\). Then, \[ \EXP[ x^\TRANS P x ] = μ^\TRANS P μ + \TR(P Σ) \]
We start from a simple observation.
Lemma 28.1 (Completion of squares) Let \(x \in \reals^{d_x}\), \(u \in \reals^{d_u}\), and \(w \in \reals^{d_x}\) be random variables defined on a common probability space. Suppose \(w\) is zero mean with finite covariance and independent of \((x,u)\). Let \(x_{+} = Ax + Bu + w\), where \(A\) and \(B\) are matrices of appropriate dimensions. Then, given matrices \(P\), \(Q\), \(S\), and \(R\) of appropriate dimensions, \[ \EXP\left[ \MATRIX{x \\ u}^\TRANS \MATRIX{Q & S \\ S^\TRANS & R } \MATRIX{ x \\ u} + x_{+} P x_{+} \right] = \EXP\bigl[ x^\TRANS P_{+} x + (u + Kx)^\TRANS Δ (u + Kx) + w^\TRANS P w \bigr]. \] where
- \(Δ = R + B^\TRANS P B\)
- \(K = Δ^{-1}[ S^\TRANS + B^\TRANS P A ]\)
- \(P_{+} = Q + A^\TRANS P A - K^\TRANS Δ K\)
Since \(w\) is zero mean and independent of \((x,u)\), we have \[ \EXP[ x_{+}^\TRANS P x ] = \EXP\bigl[ (Ax + Bu)^\TRANS P (Ax + Bu) + w^\TRANS P w \bigr]. \]
The proof follows immediately by completing the square on the left hand side. In particular \[\begin{align*} & u^\TRANS R u + 2 x^\TRANS S u + (Ax+Bu)^\TRANS P (Ax + Bu) \\ & \quad = u^\TRANS (R + B^\TRANS P B) u + 2 u^\TRANS( S^\TRANS + B^\TRANS P A) x + x^\TRANS A^\TRANS P A x \\ & \quad = u^\TRANS Δ u + 2 u^\TRANS Δ K x + x^\TRANS A^\TRANS P A x \\ & \quad = (u + K x)^\TRANS Δ (u + Kx) - x^\TRANS K^\TRANS Δ K x + x^\TRANS A^\TRANS P A x \\ \end{align*}\]
Definition 28.1 Given the system model \((A,B)\) and per-step cost \((Q,S,R)\), we define the Riccati operator \(\RICCATI \colon \mathbb{S}^{d_x × d_x}_{\ge 0} \to \mathbb{S}^{d_x × d_x}_{\ge 0}\) as follows: \[ \RICCATI P = Q + A^\TRANS P A - (S^\TRANS + B^\TRANS P A)^\TRANS (R + B^\TRANS P B)^{-1} (S^\TRANS + B^\TRANS P A).\] Moreover, define the Gain operator \(\GAIN \colon \mathbb{S}^{d_x × d_x}_{\ge 0} \to \reals^{d_u × d_x}\) as \[ \GAIN P = - (R + B^\TRANS P B)^{-1}(S^\TRANS + B^\TRANS P A). \]
Note that with the above notation, the terms defined in Lemma 28.1 may be written as \[ P_{+} = \RICCATI P \quad\text{and}\quad K = \GAIN P. \]
Riccati equations are named after Jacopo Riccati (1670–1754) who studied the differential equations of the form \[\dot x = a x^2 + b t + c t^2\] and its variations. In continuous time, such equations arise in optimal control and filtering. The discrete-time version of these equations are also named after Riccati.
I am calling the updates in Lemma 28.1 as Riccati operators because they are similar to Bellman operators considered earlier.
For the special case when \(S = 0\) (i.e., no cross terms in the cost), the Riccati operator is given by: \[ \RICCATI P = Q + A^\TRANS P A - A^\TRANS P B (R + B^\TRANS P B)^{-1} B^\TRANS P A \] The following are equivalent to the Riccati operator:
- \(A^\TRANS P(I + B R^{-1} B^\TRANS P)^{-1}A + Q\).
- \(A^\TRANS(P^{-1} + B R^{-1} B^\TRANS)^{-1} A + Q\).
Here the first equality follows from the simplified Sherman-Morrison-Woodbudy formula for \((I + (B R^{-1})(B^\TRANS P))^{-1}\) and the second follows simple algebra.
Proposition 28.1 Recursively define the matrices \(\{P_t\}_{t \ge 1}\) in a backwards manners as follows: \(P_T = Q_T\) and then for \(t \in \{T-1, \dots, 1\}\): \[\begin{equation}\label{eq:riccati} P_t = \RICCATI P_{t+1}. \end{equation}\] Moreover, define the gains \(\{K_t\}_{t \ge 1}\) as: \[\begin{equation}\label{eq:gain} K_t = \GAIN P_t. \end{equation}\]
Then, for any control policy \(g\), the total cost \(J(g)\) given by \(\eqref{eq:cost}\) may be written as \[\begin{equation} J(g) = \bar J(g) + \tilde J \end{equation}\] where the controlled part of the cost is \[ \bar J(g) = \EXP\biggl[ \sum_{t=1}^{T-1} (u_t + K_t x_t)^\TRANS Δ_t (u_t + K_t x_t) \biggr] \] with \(Δ_t = R + B^\TRANS P_{t+1} B\) and the control-free part of the cost is \[ \tilde J = \EXP\biggl[ x_1^\TRANS P_1 x_1 + \sum_{t=1}^{T-1} w_t^\TRANS P_{t+1} w_t \biggr] \]
We will use the notation: \[ K_{1:T-1} = \LQR_T(A,B,Q,R;Q_T) \quad\hbox{or}\quad (K_{1:T-1}, P_{1:T}) = \LQR_T(A,B,Q,R;Q_T) \] to denote the LQR gains \(K_{1:T-1}\) and \(P_{1:T}\) computed via \(\eqref{eq:riccati}\) and \(\eqref{eq:gain}\).
The result follows by repeatedly applying Lemma 28.1 starting at time \(t = T-1\) and moving backwards.
Now, an immediate implication of Proposition 28.1 is the following:
Theorem 28.1 The optimal control policy for the LQR problem \(\eqref{eq:cost}\) is given by \[ u_t = - K_t x_t \] where the feedback gains \(\{K_t\}_{t \ge 1}\) are computed as described in Proposition 28.1. The performance of the optimal strategy is given by: \[ J^* = \tilde J = \TR(X_1 P_1) + \sum_{t=1}^{T-1} \TR(W_t P_{t+1}) \] where \(X_1\) is the covariance of the initial state and \(\{W_t\}_{t \ge 1}\) is the covariance of the noise process.
First observe that \(\tilde J\) does not depend on the control policy. So, minimizing \(\bar J(g)\) is the same as minimizing \(J(g)\).
Now recall that \(\{R_t\}_{t \ge 1}\) are positive definite. Therefore, \(Δ_t = [R_t + B^\TRANS P B]\) are positive definite. Hence, for any policy \(g\), \(\bar J(g) \ge 0\). The proposed policy achieves \(\bar J(g) = 0\) and is, therefore, optimal.
28.2 Salient features of the result
We derived the results for time-invariant dynamics and cost, but the argument trivially generalizes to time-varying dynamics and cost as well.
The optimal gains \(\{K_t\}_{t \ge 1}\) do not depend on the distribution of the noise \(\{W_t\}_{t \ge 1}\). Thus, the noise in the dynamics does not change the closed loop control policy but changes the optimal cost by a term that depends on the noise covariance (but does not depend on the policy).
A special case of the above observation is that the optimal control policy of the stochastic LQR problem is the same as the optimal control policy of the deterministic LQR problem (where the noise \(w_t ≡ 0\)). This result is sometimes called the certainty equivalence principle.
Suppose the noise was not white but Gaussian and correlated over time (and still independent of the initial state \(x_1\)). Then, the optimal control action at time \(t\) will be the same as that of the deterministic system \[ x_{τ + 1} = A x_{τ} + B u_{τ} + w_{τ|t}, \quad τ \ge t, \] where \(w_{τ|t}\) is \(\EXP[ w_{\tau} \mid w_{1:t} ]\). That is, at time \(t\), one replaces future stochastic noise \(w_τ\) (\(τ \ge t\)) by an ‘equivalent’ deterministic noise \(w_{τ|t}\) and then applies the method of deterministic LQR to deduce the optimal feedback control in terms of the predicted noise. This is also a special instance of the general certainty equivalence principle, which also extends to the case when the state is not perfectly observed.
The assumption that \(R\) is positive definite is not necessary. In order to get a unique control law, it is sufficient that \(R + B^\TRANS P_{t} B\) be positive definite for all \(t\), which is possible even when \(R\) is not positive definite.
A particular instance where \(R\) is not positive definite is when \(R = 0\)! This is called minimum variance control.
28.3 LQ tracking of a reference trajectory
Now we consider the tracking problem. We are given a pre-specified trajectory \(\{r_t\}_{t=1}^T\) and the objective is to minimize a per-step cost given by \[ \begin{align*} c_t(x_t, u_t) &= (C x_t - r_t)^\TRANS Q (x_t - r_t) + u_t^\TRANS R u_t \\ \text{and}\quad c_T(x_T) &= (x_T - r_T)^\TRANS Q_T (x_T - r_T). \end{align*} \]
The simplest way to derive a solution to the tracking problem is to convert it to a regulation problem by using state augmentation. In particular, define \(z_t = \VEC(x_t, 1)\). Then per-step cost is quadratic in \(z_t\): \[ c(z_t, u_t) = z^\TRANS \bar Q z + u^\TRANS R u , \quad\text{where } \bar Q = \MATRIX{C & -r_t}^\TRANS Q \MATRIX{I & -r_t}. \] Moreover, \(z_t\) has linear dynamics of the form: \[ z_{t+1} = \bar A z_t + \bar B u_t + \bar B w_t , \quad\hbox{where } \bar A = \MATRIX{A & 0 \\ 0 & 1} \hbox{ and } \bar B = \MATRIX{B \\ 0}. \] Thus, we can compute the optimal gains using the solution of the regulation problem (with time-varying cost): \[ \bar K_{1:T-1} = \LQR_T(\bar A, \bar B, \bar Q_{1:T}, R). \] It can be shown (Lewis et al. 2012) that the optimal control can be written as \[ u_t = \bar K_t z_t = - K_t x_t + K ^∘_t v_{t+1} \] where \[\begin{align*} (K_{1:T-1}, P_{1:T}) &= \LQR_T(A,B,C^\TRANS Q C,R; C^\TRANS Q_T C), \\ K ^∘_t &= (R + B^\TRANS P_{t+1} B)^{-1} B^\TRANS, \end{align*}\] and the offset process \(v_{1:T}\) is computed backwards by solving \[ v_t = (A - BK_t)^\TRANS v_{t+1} + C^\TRANS Q r_t, \quad v_T = C^\TRANS Q r_T. \]
28.4 An example: second order integrator
As an example, consider a discretized model of a second-order integrator, which models the dynamics of a point-mass in one-dimensional space under time-varying force. The continuous-time dynamics of a second-order integrator are given by \[ \dot x(t) = \MATRIX{0 & 1 \\ 0 & 0} x(t) + \MATRIX{0\\ \frac 1m}u(t) \] where \(x(t) \in \reals^2\) with \(x_1(t)\) indicating position and \(x_2(t)\) indicating velocity, \(u(t) \in \reals\) denotes force, and \(d_u\) is a parameter denoting mass. We discretize the dynamics using zero-order hold (Wittenmark et al. 2002) with a sampling time of \(Δt\). First observe that the matrix \(A_c = \left[ \begin{smallmatrix}0 & 1 \\ 0 & 0\end{smallmatrix}\right]\) is nilpotent because \(A_c^2 = 0\). Therefore, the matrix exponential simplifies to \[ e^{A_ct} = I + A_c t = \MATRIX{1 & t \\ 0 & 1}. \] Therefore, the discretized model is \[\begin{align*} A &= e^{A_c Δt} = \MATRIX{1 & Δt \\ 0 & 1}, & B &= \int_{0}^{Δt} e^{A_ct} B_c dt = \MATRIX{\tfrac12 Δt^2 \\ \frac{Δt}{d_u}} \end{align*}\]
We further assume that there is a disturbance with variance \(σ^2\) at the actuation, so the discretized model is \[ x_{t+1} = A x_t + B (u_t + w_t) \] where \(w_t \sim {\cal N}(0, σ^2)\)1.
1 This model may be viewed as a model of the form \(\eqref{eq:dynamics}\) by considering the noise covariance in \(\eqref{eq:dynamics}\) to be \(σ^2 B B^\TRANS\).
Suppose \[ Q = Q_T = \MATRIX{1 & 0 \\ 0 & 0} \quad\text{and}\quad R = ρ \]
Regulation
We first solve the regulation problem. Assume \(d_u = 1\,\text{kg}\), \(Δt = 0.1\,\text{s}\) and noise covariance \(σ^2 = 0.05\). The output (i.e., position) and the input (i.e., force) for a \(T = 10\,\text{s}\) simulation of the system are shown in Figure 28.1. Note that, as expected, as the cost of applying control increases, less force is applied and it takes longer for the position to become close to zero.
Tracking
We now solve the tracking problem. Conisder the same parameters as before. We consider tracking a “square-wave” signal. The output (i.e., position) and the input (i.e., force) for a \(T = 60\,\text{s}\) simulation of the system are shown in Figure 28.2.
Exercises
Exercise 28.1 What is the optimal solution to the LQR problem when \(Q = 0\)?
Exercise 28.2 What is the optimal solution to the LQR problem when \(Q \succ 0\) and \(R = 0\)?
Exercise 28.3 (Noise coupled subsystems) Consider a system with two subsystems: subsytem 1 with state \(x^1_t \in \reals^{d_x^1}\) and control \(u^1_t \in \reals^{d_u^1}\) and subsystem 2 with state \(x^2_t \in \reals^{d_x^2}\) and control \(u^2_t \in \reals^{d_u^2}\). The dynamics are coupled only though the noise, i.e., \[\begin{align*} x^1_{t+1} &= A^{11} x^1_t + B^1 u^1_t + w^1_t \\ x^2_{t+1} &= A^{22} x^2_t + B^2 u^2_t + w^2_t \end{align*}\] where the noise process \(\{ (w^1_t, w^2_t)\}_{t \ge 1}\) is correlated across subsystems but independent across time.
Let per-step cost is decoupled across the subsystems and of the form: \[ c(x_t, u_t) = (x^1_t)^\TRANS Q^{11} x^1_t + (u^1_t)^\TRANS R^{11} u^1_t + (x^2_t)^\TRANS Q^{22} x^2_t + (u^2_t)^\TRANS R^{22} u^2_t. \] The terminal cost has a similar structure. Show that the optimal control law is of the form: \[ u^1_t = - K^1_t x^1_t \quad\text{and}\quad u^2_t = - K^2_t x^2_t \] where the gains \(K^1_t\) and \(K^2_t\) are obtained by solving two separate Riccati equations.
Hint: There are two ways to solve this problem. An algebraic method where one can argue that the Riccati gain \(P\) is diagonal and a simpler method that uses certainty equivalence.
Notes
See Athans (1971) for a general discussion of the philosophical approach of approximating general stochastic control problems as linear quadratic models. See Dorato and Levis (1971) for a general overview of discrete-time LQR including a summary of how such models might arise and different (dated) approaches to numerically solve the Riccati equation. LQR for continuous time systems was proposed by Kalman (1960) for deterministic systems and by Wonham (1968) for stochastic systems. The proof idea of completion of squares is due to Åström (1970) but we loosely follow the proof outlines adapted from Afshari and Mahajan (2023).
Exercise 28.3 is modified from the proof idea used in Arabneydi and Mahajan (2016) and Gao and Mahajan (2022).
The term certainty equivalence is due to Simon (1956), who was looking at a static problem; a similar result had earlier been shown by Theil (1954). A result which is essentially equivalent to the stochastic LQR problem is proved by Theil (1957).