ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

Overview of adaptive control for linear systems

Adaptive control is a umbrella term which is often used to describe algorithms which are able to control an unknown system (i.e., a system with unknown dynamics). In this section, we focus on adaptive control of linear systems. The simplest setup is as follows.

Consider a standard linear quadratic regulator where the dynamics are \[ x_{t+1} = A_θ x_t + B_θ u_t + w_t \] where the matrices \(A_θ\) and \(B_θ\) are parameterized by a parameter \(θ\) which takes values in a set \(Θ\). The per-step cost is \(c(x_t, u_t) = x_t^\TRANS Q x_t + u_t^\TRANS R u_t\). For simplicity, we assume that the \(Q\) and \(R\) matrices are known but all we know about the \(A_θ\) and \(B_θ\) matrices is that \(θ\) lies in the set \(Θ\). The objective is the design a controller which will minimize the long term average cost \[ \lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T \bigl[ x_t^\TRANS Q x_t + u_t^\TRANS R u_t \bigr] \Bigr]. \] Had we known \(A_θ\) and \(B_θ\), we know that the optimal controller is given by \[ u_t = - L_θ x_t \] where \(L_θ = [R + B_θ^\TRANS S_θ B_θ]^{-1} B_θ^\TRANS S_θ A_θ\) and \(S_θ\) is the solution to the discrete-time algebric Riccati equation: \[ S_θ = A_θ^\TRANS S_θ A_θ + Q - A_θ^\TRANS S_θ^\TRANS B_θ[ R + B_θ^\TRANS S_θ B_θ]^{-1} B_θ^\TRANS S_θ A_θ \] and the corresponding optimal performance is \(J_θ = \TR(S_θΣ)\).

But what do we do if the system model is unknown? In this section, we overview the basic ideas for adaptive control (or reinforcement learning).

1 Certainty equivalence

The simplest idea in adaptive control is certainty equivalence: at each time generate an estimate \(\hat θ_t\) the unknown parameters \(θ\) of the model based on the observed data and use the controller \[ u_t = -L_{\hat θ_t} x_t. \]

The idea is that if \(\hat θ_t\) converges to \(θ\) (or more weakly \(L_{\hat θ_t}\) converges to \(L_θ\), then the performance of the certainty equivalence controller will converge to the optimal performance.

1.1 A simple example

We present a simple example to illustrate that it is possible to control a linear system with unknown dymamics. Consider a scalar linear system \[ x_{t+1} = a x_t + b u_t + w_t \] where \(x_t, u_t, w_t \in \reals\) and \(\{w_t\}_{t \ge 1}\) is an i.i.d. process with zero mean and variance \(σ^2\). Suppse the per-step cost is \(c(x_t, u_t) = x_t^2\) and our objective is to minimize the long-term average cost \[ \lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T x_t^2 \Bigr]. \] It is easy to verify that the solution of the discrete-time algebric Riccati equation is given by \(S = 1\) and therefore the optimal control action is \[ u_t = - \frac{a}{b} x_t \] and the optimal average cost is \(σ^2\).

Now suppose the parameters \(θ = \VEC(a,b)\) are unknown. We assume that \(θ \in \reals^2\). Suppose we use linear least squares estimates to identify \(θ\). Let \(z_t\) denote \(\VEC(x_t, u_t)\). Note that we can write \[ x_{t+1} = z_t^\TRANS θ + w_t. \]

Combining oberrvations until time \(t\), we have \[ \MATRIX{x_2 \\ \vdots \\ x_{t}} = \MATRIX{z_1^\TRANS \\ \vdots \\ z_{t-1}^\TRANS } θ + \MATRIX{w_1 \\ \vdots \\ w_{t-1}} \] which may be written in vector form as \[ X_t = Z_t θ + W_t, \] where \[ X_t = \MATRIX{x_2 \\ \vdots \\ x_{t}}, \quad Z_t = \MATRIX{z_1^\TRANS \\ \vdots \\ z_{t-1}^\TRANS }, \quad W_t = \MATRIX{w_1 \\ \vdots \\ w_{t-1}}. \]

Thus, the linear least squares estimator of \(θ\) is given by \[ \hat θ_t = (Z_t^\TRANS Z_t)^{-1} Z_t^\TRANS X_t = \bigl[ \sum_{τ = 1}^{t-1} z_τ z_τ^\TRANS \bigr]^{-1} \sum_{τ=1}^{t-1} z_τ x_{τ+1}. \]

The linear least squares estimate can be written in recursive form as: \(\hat θ_0 = \VEC(0, 1)\), \(Σ_0 = I\), and for \(t \ge 0\): \[\begin{align*} \hat θ_{t+1} &= \hat θ_t + \frac{ Σ_t z_t ( x_{t+1} - z_t^\TRANS \hat θ_t) }{σ^2 + z_t^\TRANS Σ_t z_t }, \\ Σ_{t+1} &= Σ_t - \frac{ Σ_t z_t z_t^\TRANS Σ_t }{ σ^2 + z_t^\TRANS Σ_t z_t }. \end{align*}\]

Note that it can also be shown that the covariance \(Σ_t\) satisfies the recusion \[ Σ_{t+1}^{-1} = Σ_t^{-1} + \frac{1}{σ^2} z_t z_t^\TRANS. \]

Now, let \(\hat θ_t = (\hat a_t, \hat b_t)\). The certainty equivalent control law is given by \[ u_t = - \frac{\hat a_t}{\hat b_t} x_t. \]

It can be shown that under some conditions on the parameters of the model, the sequence \(\{ \hat a_t/ \hat b_t\}_{t \ge 1}\) converges to \(a/b\), so asymptotically we will use the right control law, even though \(\hat θ_t\) does not converge to \(θ\).

Let’s run a simple simulation to test this. Suppose $a = $ , $b = $ , and $σ = $ . The resulting plot for a single sample path is shown below. You can generate another sample path by clicking

1.2 Certainty equivalence can fail

Now, we present a simple example to show that certainty equivalence can fail. Again, consider the simple scalar example from before \[ x_{t+1} = a x_t + b u_t + w_t \] where \(x_t, u_t, w_t \in \reals\) and \(\{w_t\}_{t \ge 1}\) is an i.i.d. process with zero mean and variance \(σ^2\). The per-step cost is \(c(x_t, u_t) = x_t^2 + 2 u_t^2\). So, our objective is to minimize the long-term average cost \[ \lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T x_t^2 + 2 u_t^2 \Bigr]. \]

Now suppose that \(θ = (a,b)\) are unknown but we know that \(θ\) lies in the set \(Θ = \{ (0, -1), (1, 1) \}\). When \((a,b) = (0, -1)\), \(S = 1\) and the optimal control law is \(u_t = 0\). When \((a,b) = (1,1)\), \(S = 2\) and the optimal control law is \(u_t = -\tfrac 12 x_t)\). Thus, the optimal control law is \[ u_t = \begin{cases} 0, & \text{if } (a, b) = (0, -1), \\ -\tfrac12 x_t, & \text{if } (a, b) = (1, 1). \end{cases} \]

Suppose, as before, we use a certainty equivalence control law where we estimate the parameters \(θ\) using least-squares (or equivalently, in this case, a maximum-likelihood) estimate based on the observed data up to time \(t\). The least-squares estimate is \[ \hat θ_t = \begin{cases} (0, -1), & \text{if } \sum_{τ=1}^{t-1} (x_{τ+1} + u_τ)^2 \le \sum_{τ=1}^{t-1}(x_{τ+1} - x_τ - u_τ)^2, \\ (1, 1), & \text{otherwise}. \end{cases} \]

Now suppose at some point of time, the estimate is \((1,1)\). Which means that \[ \sum_{τ=1}^{t-1} (x_{τ+1} + u_τ)^2 > \sum_{τ=1}^{t-1}(x_{τ+1} - x_τ - u_τ)^2, \] and by using the certainty equivalence controller, we chose \(u_t = -\tfrac12 x_t\). This implies that \[ (x_{t+1} + u_t)^2 = (x_{t+1} - x_t - u_t)^2. \] Thus, at time \(t+1\), we will have \[ \sum_{τ=1}^{t} (x_{τ+1} + u_τ)^2 > \sum_{τ=1}^{t}(x_{τ+1} - x_τ - u_τ)^2 \] and, therefore, the estimate \(\hat θ_{t+1} = (1,1)\). Repeating the above argument at time \(t+1\), we get that if at any time the parameters \(θ\) are estimated to be \((1,1)\), then the parameter estimates will thereafter remain unchanged and the adaptive control law will “stick” at \(u_τ = -\tfrac12 x_τ\) for all \(τ \ge t\). This is clearly undesirable if the true value of the parameters is \((0, -1)\).

To see that this can indeed happen with positive probability, suppose that \((a,b) = (0, -1)\) is indeed the true system and we start initially with \(x_1 = 1\) and \(u_1 = 0\). Then, \[ \PR( (x_2 + u_1)^2 > (x_2 - x_1 - u_1)^2 ) = \PR( w_1^2 > (w_1 - 1)^2 ) = \PR(w_1 > \tfrac12). \] Since \(w_1 \sim {\cal N}(0,1)\), the event \(\{ w_1 > \tfrac12\}\) occurs with probability \(0.31\) and the adaptive control law will “stick” with probability at least 0.31 at the non-optimal (cost = 2 versus cost = 1) control law.

To circumvent “sticking” to the bad controller, we need to make sure that a condition known as “persistent excitation” is satisfied, which ensures that there is sufficient exploration. More on that later.

2 Thompson sampling

One of the oldest and simplest algorithms to ensure sufficient exploration is Thompson sampling, which is Bayesian approach that relies on a very simple idea. Maintain a posterior belief \(π_t\) over the parameter \(θ\). At time \(t\), sample a value \(\tilde θ_t\) from the posterior \(π_t\) and choose the control law \[ u_t = - L_{\tilde θ_t} x_t. \] The intuition behind the approach is that if we take enough samples, \(π_t\) will converge to a dirac delta distribution centered at the true \(θ\) and therefore \(θ_t\) will be the same as \(θ\).

2.1 A simple example

We present a simple example to illustrate how Thompson sampling works. As before, let’s consider a scalar system \[ x_{t+1} = a x_t + b u_t + w_t \] where \(x_t, u_t, w_t \in \reals\) and \(\{w_t\}_{t \ge 1}\) is an i.i.d. process with zero mean and variance \(σ^2\). The per-step cost is \(c(x_t, u_t) = x_t^2\) and our objective is to minimize the long-term average cost \[ \lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T x_t^2 \Bigr]. \] As we have seen before, if we knew the model parameters, the optimal control law in this case is \[ u_t = -\frac{a}{b} x_t. \]

Now suppose we had an initial prior \(π_0 \sim {\cal N}(θ_0, Σ_0)\) over \(θ\). where \(θ_0 = \VEC(0,1)\) and \(Σ_0 = I\). Let \(z_t\) denote \(\VEC(x_t, u_t)\). As before, we can write \[ x_{t+1} = z_t^\TRANS θ + w_t. \] Thus, at each time we get a noisy observation of the unknown parameter \(θ\) though the “observation channel” \(z_t^\TRANS\). This is the standard estimation problem with Gaussian observations. From standard results in Kalman filtering, we know that the posterior \(π_t(dθ) = \PR( θ \in dθ | z_{1:t})\) is Gaussian, say \({\cal N}(θ_t, Σ_t)\), where \((θ_t, Σ_t)\) satisfy the recursion

\[\begin{align*} θ_{t+1} &= θ_t + \frac{ Σ_t z_t (x_{t+1} - z_t^\TRANS θ_t) } { σ^2 + z_t^\TRANS Σ_t z_t } \\ Σ_{t+1} &= Σ_t - \frac{ Σ_t z_t z_t^\TRANS Σ_t }{ σ^2 + z_t^\TRANS Σ_t z_t }. \end{align*}\]

Note that the update above is identical to the update for the recursive least squares estimates. Then, at time \(t\), we sample \[ \tilde θ_t \sim {\cal N}(θ_t, Σ_t). \] Now, let \(\tilde θ_t = (\tilde a_t, \tilde b_t)\). Then, the Thompson sampling control law is given by \[ u_t = - \frac{\tilde a_t}{\tilde b_t} x_t. \] It can be shown that under some conditions on the parameters of the model, the posterior \(θ_t\) converges to a \(θ^*\) such that \(L_{θ^*} = L_{θ}\) and the covariance \(Σ_t\) converges to \(0\).

Let’s run a simple simulation to test this. Suppose $a = $ , $b = $ , and $σ = $ . The resulting plot for a single sample path is shown below. You can generate another sample path by clicking

References

The material for the example that certainty equivalence can fail is taken from Kumar (1983).

Kumar, P.R. 1983. Optimal adaptive control of linear-quadratic-gaussian systems. SIAM Journal on Control and Optimization 21, 2, 163–178. DOI: 10.1137/0321009.

This entry was last updated on 01 Aug 2021 and posted in RL and tagged linear systems, adaptive control, lqr.