# ECSE 506: Stochastic Control and Decision Theory

## Aditya Mahajan

Winter 2022

### About | Lectures | Notes | Coursework

Adaptive control is a umbrella term which is often used to describe algorithms which are able to control an unknown system (i.e., a system with unknown dynamics). In this section, we focus on adaptive control of linear systems. The simplest setup is as follows.

Consider a standard linear quadratic regulator where the dynamics are \[ x_{t+1} = A_θ x_t + B_θ u_t + w_t \] where the matrices \(A_θ\) and \(B_θ\) are parameterized by a parameter \(θ\) which takes values in a set \(Θ\). The per-step cost is \(c(x_t, u_t) = x_t^\TRANS Q x_t + u_t^\TRANS R u_t\). For simplicity, we assume that the \(Q\) and \(R\) matrices are known but all we know about the \(A_θ\) and \(B_θ\) matrices is that \(θ\) lies in the set \(Θ\). The objective is the design a controller which will minimize the long term average cost \[ \lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T \bigl[ x_t^\TRANS Q x_t + u_t^\TRANS R u_t \bigr] \Bigr]. \] Had we known \(A_θ\) and \(B_θ\), we know that the optimal controller is given by \[ u_t = - L_θ x_t \] where \(L_θ = [R + B_θ^\TRANS S_θ B_θ]^{-1} B_θ^\TRANS S_θ A_θ\) and \(S_θ\) is the solution to the discrete-time algebric Riccati equation: \[ S_θ = A_θ^\TRANS S_θ A_θ + Q - A_θ^\TRANS S_θ^\TRANS B_θ[ R + B_θ^\TRANS S_θ B_θ]^{-1} B_θ^\TRANS S_θ A_θ \] and the corresponding optimal performance is \(J_θ = \TR(S_θΣ)\).

But what do we do if the system model is unknown? In this section, we overview the basic ideas for adaptive control (or reinforcement learning).

# 1 Certainty equivalence

The simplest idea in adaptive control is **certainty equivalence**: at each time generate an estimate \(\hat θ_t\) the unknown parameters \(θ\) of the model based on the observed data and use the controller \[
u_t = -L_{\hat θ_t} x_t.
\]

The idea is that if \(\hat θ_t\) converges to \(θ\) (or more weakly \(L_{\hat θ_t}\) converges to \(L_θ\), then the performance of the certainty equivalence controller will converge to the optimal performance.

## 1.1 A simple example

We present a simple example to illustrate that it is possible to control a linear system with unknown dymamics. Consider a scalar linear system \[ x_{t+1} = a x_t + b u_t + w_t \] where \(x_t, u_t, w_t \in \reals\) and \(\{w_t\}_{t \ge 1}\) is an i.i.d. process with zero mean and variance \(σ^2\). Suppse the per-step cost is \(c(x_t, u_t) = x_t^2\) and our objective is to minimize the long-term average cost \[ \lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T x_t^2 \Bigr]. \] It is easy to verify that the solution of the discrete-time algebric Riccati equation is given by \(S = 1\) and therefore the optimal control action is \[ u_t = - \frac{a}{b} x_t \] and the optimal average cost is \(σ^2\).

Now suppose the parameters \(θ = \VEC(a,b)\) are unknown. We assume that \(θ \in \reals^2\). Suppose we use linear least squares estimates to identify \(θ\). Let \(z_t\) denote \(\VEC(x_t, u_t)\). Note that we can write \[ x_{t+1} = z_t^\TRANS θ + w_t. \]

Combining oberrvations until time \(t\), we have \[ \MATRIX{x_2 \\ \vdots \\ x_{t}} = \MATRIX{z_1^\TRANS \\ \vdots \\ z_{t-1}^\TRANS } θ + \MATRIX{w_1 \\ \vdots \\ w_{t-1}} \] which may be written in vector form as \[ X_t = Z_t θ + W_t, \] where \[ X_t = \MATRIX{x_2 \\ \vdots \\ x_{t}}, \quad Z_t = \MATRIX{z_1^\TRANS \\ \vdots \\ z_{t-1}^\TRANS }, \quad W_t = \MATRIX{w_1 \\ \vdots \\ w_{t-1}}. \]

Thus, the linear least squares estimator of \(θ\) is given by \[ \hat θ_t = (Z_t^\TRANS Z_t)^{-1} Z_t^\TRANS X_t = \bigl[ \sum_{τ = 1}^{t-1} z_τ z_τ^\TRANS \bigr]^{-1} \sum_{τ=1}^{t-1} z_τ x_{τ+1}. \]

The linear least squares estimate can be written in recursive form as: \(\hat θ_0 = \VEC(0, 1)\), \(Σ_0 = I\), and for \(t \ge 0\): \[\begin{align*} \hat θ_{t+1} &= \hat θ_t + \frac{ Σ_t z_t ( x_{t+1} - z_t^\TRANS \hat θ_t) }{σ^2 + z_t^\TRANS Σ_t z_t }, \\ Σ_{t+1} &= Σ_t - \frac{ Σ_t z_t z_t^\TRANS Σ_t }{ σ^2 + z_t^\TRANS Σ_t z_t }. \end{align*}\]

Note that it can also be shown that the covariance \(Σ_t\) satisfies the recusion \[ Σ_{t+1}^{-1} = Σ_t^{-1} + \frac{1}{σ^2} z_t z_t^\TRANS. \]

Now, let \(\hat θ_t = (\hat a_t, \hat b_t)\). The certainty equivalent control law is given by \[ u_t = - \frac{\hat a_t}{\hat b_t} x_t. \]

It can be shown that under some conditions on the parameters of the model, the sequence \(\{ \hat a_t/ \hat b_t\}_{t \ge 1}\) converges to \(a/b\), so asymptotically we will use the right control law, even though \(\hat θ_t\) does not converge to \(θ\).

Let’s run a simple simulation to test this. Suppose $a = $ , $b = $ , and $σ = $ . The resulting plot for a single sample path is shown below. You can generate another sample path by clicking

## 1.2 Certainty equivalence can fail

Now, we present a simple example to show that certainty equivalence can fail. Again, consider the simple scalar example from before \[ x_{t+1} = a x_t + b u_t + w_t \] where \(x_t, u_t, w_t \in \reals\) and \(\{w_t\}_{t \ge 1}\) is an i.i.d. process with zero mean and variance \(σ^2\). The per-step cost is \(c(x_t, u_t) = x_t^2 + 2 u_t^2\). So, our objective is to minimize the long-term average cost \[ \lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T x_t^2 + 2 u_t^2 \Bigr]. \]

Now suppose that \(θ = (a,b)\) are unknown but we know that \(θ\) lies in the set \(Θ = \{ (0, -1), (1, 1) \}\). When \((a,b) = (0, -1)\), \(S = 1\) and the optimal control law is \(u_t = 0\). When \((a,b) = (1,1)\), \(S = 2\) and the optimal control law is \(u_t = -\tfrac 12 x_t)\). Thus, the optimal control law is \[ u_t = \begin{cases} 0, & \text{if } (a, b) = (0, -1), \\ -\tfrac12 x_t, & \text{if } (a, b) = (1, 1). \end{cases} \]

Suppose, as before, we use a certainty equivalence control law where we estimate the parameters \(θ\) using least-squares (or equivalently, in this case, a maximum-likelihood) estimate based on the observed data up to time \(t\). The least-squares estimate is \[ \hat θ_t = \begin{cases} (0, -1), & \text{if } \sum_{τ=1}^{t-1} (x_{τ+1} + u_τ)^2 \le \sum_{τ=1}^{t-1}(x_{τ+1} - x_τ - u_τ)^2, \\ (1, 1), & \text{otherwise}. \end{cases} \]

Now suppose at some point of time, the estimate is \((1,1)\). Which means that \[
\sum_{τ=1}^{t-1} (x_{τ+1} + u_τ)^2
> \sum_{τ=1}^{t-1}(x_{τ+1} - x_τ - u_τ)^2,
\] and by using the certainty equivalence controller, we chose \(u_t = -\tfrac12 x_t\). This implies that \[ (x_{t+1} + u_t)^2 = (x_{t+1} - x_t - u_t)^2. \] Thus, at time \(t+1\), we will have \[
\sum_{τ=1}^{t} (x_{τ+1} + u_τ)^2
> \sum_{τ=1}^{t}(x_{τ+1} - x_τ - u_τ)^2
\] and, therefore, the estimate \(\hat θ_{t+1} = (1,1)\). Repeating the above argument at time \(t+1\), we get that if at *any* time the parameters \(θ\) are estimated to be \((1,1)\), then the parameter estimates will thereafter remain *unchanged* and the adaptive control law will “stick” at \(u_τ = -\tfrac12 x_τ\) for all \(τ \ge t\). This is clearly undesirable if the true value of the parameters is \((0, -1)\).

To see that this can indeed happen with positive probability, suppose that \((a,b) = (0, -1)\) is indeed the true system and we start initially with \(x_1 = 1\) and \(u_1 = 0\). Then, \[ \PR( (x_2 + u_1)^2 > (x_2 - x_1 - u_1)^2 ) =
\PR( w_1^2 > (w_1 - 1)^2 ) = \PR(w_1 > \tfrac12).
\] Since \(w_1 \sim {\cal N}(0,1)\), the event \(\{ w_1 > \tfrac12\}\) occurs with probability \(0.31\) and the adaptive control law will “stick” with probability *at least* 0.31 at the *non-optimal* (cost = 2 versus cost = 1) control law.

To circumvent “sticking” to the bad controller, we need to make sure that a condition known as “persistent excitation” is satisfied, which ensures that there is sufficient exploration. More on that later.

# 2 Thompson sampling

One of the oldest and simplest algorithms to ensure sufficient exploration is Thompson sampling, which is Bayesian approach that relies on a very simple idea. Maintain a posterior belief \(π_t\) over the parameter \(θ\). At time \(t\), sample a value \(\tilde θ_t\) from the posterior \(π_t\) and choose the control law \[ u_t = - L_{\tilde θ_t} x_t. \] The intuition behind the approach is that if we take enough samples, \(π_t\) will converge to a dirac delta distribution centered at the true \(θ\) and therefore \(θ_t\) will be the same as \(θ\).

## 2.1 A simple example

We present a simple example to illustrate how Thompson sampling works. As before, let’s consider a scalar system \[ x_{t+1} = a x_t + b u_t + w_t \] where \(x_t, u_t, w_t \in \reals\) and \(\{w_t\}_{t \ge 1}\) is an i.i.d. process with zero mean and variance \(σ^2\). The per-step cost is \(c(x_t, u_t) = x_t^2\) and our objective is to minimize the long-term average cost \[ \lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T x_t^2 \Bigr]. \] As we have seen before, if we knew the model parameters, the optimal control law in this case is \[ u_t = -\frac{a}{b} x_t. \]

Now suppose we had an initial prior \(π_0 \sim {\cal N}(θ_0, Σ_0)\) over \(θ\). where \(θ_0 = \VEC(0,1)\) and \(Σ_0 = I\). Let \(z_t\) denote \(\VEC(x_t, u_t)\). As before, we can write \[ x_{t+1} = z_t^\TRANS θ + w_t. \] Thus, at each time we get a noisy observation of the unknown parameter \(θ\) though the “observation channel” \(z_t^\TRANS\). This is the standard estimation problem with Gaussian observations. From standard results in Kalman filtering, we know that the posterior \(π_t(dθ) = \PR( θ \in dθ | z_{1:t})\) is Gaussian, say \({\cal N}(θ_t, Σ_t)\), where \((θ_t, Σ_t)\) satisfy the recursion

\[\begin{align*} θ_{t+1} &= θ_t + \frac{ Σ_t z_t (x_{t+1} - z_t^\TRANS θ_t) } { σ^2 + z_t^\TRANS Σ_t z_t } \\ Σ_{t+1} &= Σ_t - \frac{ Σ_t z_t z_t^\TRANS Σ_t }{ σ^2 + z_t^\TRANS Σ_t z_t }. \end{align*}\]

Note that the update above is identical to the update for the recursive least squares estimates. Then, at time \(t\), we sample \[ \tilde θ_t \sim {\cal N}(θ_t, Σ_t). \] Now, let \(\tilde θ_t = (\tilde a_t, \tilde b_t)\). Then, the Thompson sampling control law is given by \[ u_t = - \frac{\tilde a_t}{\tilde b_t} x_t. \] It can be shown that under some conditions on the parameters of the model, the posterior \(θ_t\) converges to a \(θ^*\) such that \(L_{θ^*} = L_{θ}\) and the covariance \(Σ_t\) converges to \(0\).

Let’s run a simple simulation to test this. Suppose $a = $ , $b = $ , and $σ = $ . The resulting plot for a single sample path is shown below. You can generate another sample path by clicking

# References

The material for the example that certainty equivalence can fail is taken from Kumar (1983).

*SIAM Journal on Control and Optimization*

*21*, 2, 163–178. DOI: 10.1137/0321009.

This entry was last updated on 01 Aug 2021 and posted in RL and tagged linear systems, adaptive control, lqr.