# ECSE 506: Stochastic Control and Decision Theory

Overview of adaptive control for linear systems

Adaptive control is a umbrella term which is often used to describe algorithms which are able to control an unknown system (i.e., a system with unknown dynamics). In this section, we focus on adaptive control of linear systems. The simplest setup is as follows.

Consider a standard linear quadratic regulator where the dynamics are $x_{t+1} = A_θ x_t + B_θ u_t + w_t$ where the matrices $$A_θ$$ and $$B_θ$$ are parameterized by a parameter $$θ$$ which takes values in a set $$Θ$$. The per-step cost is $$c(x_t, u_t) = x_t^\TRANS Q x_t + u_t^\TRANS R u_t$$. For simplicity, we assume that the $$Q$$ and $$R$$ matrices are known but all we know about the $$A_θ$$ and $$B_θ$$ matrices is that $$θ$$ lies in the set $$Θ$$. The objective is the design a controller which will minimize the long term average cost $\lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T \bigl[ x_t^\TRANS Q x_t + u_t^\TRANS R u_t \bigr] \Bigr].$ Had we known $$A_θ$$ and $$B_θ$$, we know that the optimal controller is given by $u_t = - L_θ x_t$ where $$L_θ = [R + B_θ^\TRANS S_θ B_θ]^{-1} B_θ^\TRANS S_θ A_θ$$ and $$S_θ$$ is the solution to the discrete-time algebric Riccati equation: $S_θ = A_θ^\TRANS S_θ A_θ + Q - A_θ^\TRANS S_θ^\TRANS B_θ[ R + B_θ^\TRANS S_θ B_θ]^{-1} B_θ^\TRANS S_θ A_θ$ and the corresponding optimal performance is $$J_θ = \TR(S_θΣ)$$.

But what do we do if the system model is unknown? In this section, we overview the basic ideas for adaptive control (or reinforcement learning).

# 1 Certainty equivalence

The simplest idea in adaptive control is certainty equivalence: at each time generate an estimate $$\hat θ_t$$ the unknown parameters $$θ$$ of the model based on the observed data and use the controller $u_t = -L_{\hat θ_t} x_t.$

The idea is that if $$\hat θ_t$$ converges to $$θ$$ (or more weakly $$L_{\hat θ_t}$$ converges to $$L_θ$$, then the performance of the certainty equivalence controller will converge to the optimal performance.

## 1.1 A simple example

We present a simple example to illustrate that it is possible to control a linear system with unknown dymamics. Consider a scalar linear system $x_{t+1} = a x_t + b u_t + w_t$ where $$x_t, u_t, w_t \in \reals$$ and $$\{w_t\}_{t \ge 1}$$ is an i.i.d. process with zero mean and variance $$σ^2$$. Suppse the per-step cost is $$c(x_t, u_t) = x_t^2$$ and our objective is to minimize the long-term average cost $\lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T x_t^2 \Bigr].$ It is easy to verify that the solution of the discrete-time algebric Riccati equation is given by $$S = 1$$ and therefore the optimal control action is $u_t = - \frac{a}{b} x_t$ and the optimal average cost is $$σ^2$$.

Now suppose the parameters $$θ = \VEC(a,b)$$ are unknown. We assume that $$θ \in \reals^2$$. Suppose we use linear least squares estimates to identify $$θ$$. Let $$z_t$$ denote $$\VEC(x_t, u_t)$$. Note that we can write $x_{t+1} = z_t^\TRANS θ + w_t.$

Combining oberrvations until time $$t$$, we have $\MATRIX{x_2 \\ \vdots \\ x_{t}} = \MATRIX{z_1^\TRANS \\ \vdots \\ z_{t-1}^\TRANS } θ + \MATRIX{w_1 \\ \vdots \\ w_{t-1}}$ which may be written in vector form as $X_t = Z_t θ + W_t,$ where $X_t = \MATRIX{x_2 \\ \vdots \\ x_{t}}, \quad Z_t = \MATRIX{z_1^\TRANS \\ \vdots \\ z_{t-1}^\TRANS }, \quad W_t = \MATRIX{w_1 \\ \vdots \\ w_{t-1}}.$

Thus, the linear least squares estimator of $$θ$$ is given by $\hat θ_t = (Z_t^\TRANS Z_t)^{-1} Z_t^\TRANS X_t = \bigl[ \sum_{τ = 1}^{t-1} z_τ z_τ^\TRANS \bigr]^{-1} \sum_{τ=1}^{t-1} z_τ x_{τ+1}.$

The linear least squares estimate can be written in recursive form as: $$\hat θ_0 = \VEC(0, 1)$$, $$Σ_0 = I$$, and for $$t \ge 0$$: \begin{align*} \hat θ_{t+1} &= \hat θ_t + \frac{ Σ_t z_t ( x_{t+1} - z_t^\TRANS \hat θ_t) }{σ^2 + z_t^\TRANS Σ_t z_t }, \\ Σ_{t+1} &= Σ_t - \frac{ Σ_t z_t z_t^\TRANS Σ_t }{ σ^2 + z_t^\TRANS Σ_t z_t }. \end{align*}

Note that it can also be shown that the covariance $$Σ_t$$ satisfies the recusion $Σ_{t+1}^{-1} = Σ_t^{-1} + \frac{1}{σ^2} z_t z_t^\TRANS.$

Now, let $$\hat θ_t = (\hat a_t, \hat b_t)$$. The certainty equivalent control law is given by $u_t = - \frac{\hat a_t}{\hat b_t} x_t.$

It can be shown that under some conditions on the parameters of the model, the sequence $$\{ \hat a_t/ \hat b_t\}_{t \ge 1}$$ converges to $$a/b$$, so asymptotically we will use the right control law, even though $$\hat θ_t$$ does not converge to $$θ$$.

Let’s run a simple simulation to test this. Suppose $a =$ , $b =$ , and $σ =$ . The resulting plot for a single sample path is shown below. You can generate another sample path by clicking

## 1.2 Certainty equivalence can fail

Now, we present a simple example to show that certainty equivalence can fail. Again, consider the simple scalar example from before $x_{t+1} = a x_t + b u_t + w_t$ where $$x_t, u_t, w_t \in \reals$$ and $$\{w_t\}_{t \ge 1}$$ is an i.i.d. process with zero mean and variance $$σ^2$$. The per-step cost is $$c(x_t, u_t) = x_t^2 + 2 u_t^2$$. So, our objective is to minimize the long-term average cost $\lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T x_t^2 + 2 u_t^2 \Bigr].$

Now suppose that $$θ = (a,b)$$ are unknown but we know that $$θ$$ lies in the set $$Θ = \{ (0, -1), (1, 1) \}$$. When $$(a,b) = (0, -1)$$, $$S = 1$$ and the optimal control law is $$u_t = 0$$. When $$(a,b) = (1,1)$$, $$S = 2$$ and the optimal control law is $$u_t = -\tfrac 12 x_t)$$. Thus, the optimal control law is $u_t = \begin{cases} 0, & \text{if } (a, b) = (0, -1), \\ -\tfrac12 x_t, & \text{if } (a, b) = (1, 1). \end{cases}$

Suppose, as before, we use a certainty equivalence control law where we estimate the parameters $$θ$$ using least-squares (or equivalently, in this case, a maximum-likelihood) estimate based on the observed data up to time $$t$$. The least-squares estimate is $\hat θ_t = \begin{cases} (0, -1), & \text{if } \sum_{τ=1}^{t-1} (x_{τ+1} + u_τ)^2 \le \sum_{τ=1}^{t-1}(x_{τ+1} - x_τ - u_τ)^2, \\ (1, 1), & \text{otherwise}. \end{cases}$

Now suppose at some point of time, the estimate is $$(1,1)$$. Which means that $\sum_{τ=1}^{t-1} (x_{τ+1} + u_τ)^2 > \sum_{τ=1}^{t-1}(x_{τ+1} - x_τ - u_τ)^2,$ and by using the certainty equivalence controller, we chose $$u_t = -\tfrac12 x_t$$. This implies that $(x_{t+1} + u_t)^2 = (x_{t+1} - x_t - u_t)^2.$ Thus, at time $$t+1$$, we will have $\sum_{τ=1}^{t} (x_{τ+1} + u_τ)^2 > \sum_{τ=1}^{t}(x_{τ+1} - x_τ - u_τ)^2$ and, therefore, the estimate $$\hat θ_{t+1} = (1,1)$$. Repeating the above argument at time $$t+1$$, we get that if at any time the parameters $$θ$$ are estimated to be $$(1,1)$$, then the parameter estimates will thereafter remain unchanged and the adaptive control law will “stick” at $$u_τ = -\tfrac12 x_τ$$ for all $$τ \ge t$$. This is clearly undesirable if the true value of the parameters is $$(0, -1)$$.

To see that this can indeed happen with positive probability, suppose that $$(a,b) = (0, -1)$$ is indeed the true system and we start initially with $$x_1 = 1$$ and $$u_1 = 0$$. Then, $\PR( (x_2 + u_1)^2 > (x_2 - x_1 - u_1)^2 ) = \PR( w_1^2 > (w_1 - 1)^2 ) = \PR(w_1 > \tfrac12).$ Since $$w_1 \sim {\cal N}(0,1)$$, the event $$\{ w_1 > \tfrac12\}$$ occurs with probability $$0.31$$ and the adaptive control law will “stick” with probability at least 0.31 at the non-optimal (cost = 2 versus cost = 1) control law.

To circumvent “sticking” to the bad controller, we need to make sure that a condition known as “persistent excitation” is satisfied, which ensures that there is sufficient exploration. More on that later.

# 2 Thompson sampling

One of the oldest and simplest algorithms to ensure sufficient exploration is Thompson sampling, which is Bayesian approach that relies on a very simple idea. Maintain a posterior belief $$π_t$$ over the parameter $$θ$$. At time $$t$$, sample a value $$\tilde θ_t$$ from the posterior $$π_t$$ and choose the control law $u_t = - L_{\tilde θ_t} x_t.$ The intuition behind the approach is that if we take enough samples, $$π_t$$ will converge to a dirac delta distribution centered at the true $$θ$$ and therefore $$θ_t$$ will be the same as $$θ$$.

## 2.1 A simple example

We present a simple example to illustrate how Thompson sampling works. As before, let’s consider a scalar system $x_{t+1} = a x_t + b u_t + w_t$ where $$x_t, u_t, w_t \in \reals$$ and $$\{w_t\}_{t \ge 1}$$ is an i.i.d. process with zero mean and variance $$σ^2$$. The per-step cost is $$c(x_t, u_t) = x_t^2$$ and our objective is to minimize the long-term average cost $\lim_{T \to ∞} \frac{1}{T} \EXP\Bigl[ \sum_{t=1}^T x_t^2 \Bigr].$ As we have seen before, if we knew the model parameters, the optimal control law in this case is $u_t = -\frac{a}{b} x_t.$

Now suppose we had an initial prior $$π_0 \sim {\cal N}(θ_0, Σ_0)$$ over $$θ$$. where $$θ_0 = \VEC(0,1)$$ and $$Σ_0 = I$$. Let $$z_t$$ denote $$\VEC(x_t, u_t)$$. As before, we can write $x_{t+1} = z_t^\TRANS θ + w_t.$ Thus, at each time we get a noisy observation of the unknown parameter $$θ$$ though the “observation channel” $$z_t^\TRANS$$. This is the standard estimation problem with Gaussian observations. From standard results in Kalman filtering, we know that the posterior $$π_t(dθ) = \PR( θ \in dθ | z_{1:t})$$ is Gaussian, say $${\cal N}(θ_t, Σ_t)$$, where $$(θ_t, Σ_t)$$ satisfy the recursion

\begin{align*} θ_{t+1} &= θ_t + \frac{ Σ_t z_t (x_{t+1} - z_t^\TRANS θ_t) } { σ^2 + z_t^\TRANS Σ_t z_t } \\ Σ_{t+1} &= Σ_t - \frac{ Σ_t z_t z_t^\TRANS Σ_t }{ σ^2 + z_t^\TRANS Σ_t z_t }. \end{align*}

Note that the update above is identical to the update for the recursive least squares estimates. Then, at time $$t$$, we sample $\tilde θ_t \sim {\cal N}(θ_t, Σ_t).$ Now, let $$\tilde θ_t = (\tilde a_t, \tilde b_t)$$. Then, the Thompson sampling control law is given by $u_t = - \frac{\tilde a_t}{\tilde b_t} x_t.$ It can be shown that under some conditions on the parameters of the model, the posterior $$θ_t$$ converges to a $$θ^*$$ such that $$L_{θ^*} = L_{θ}$$ and the covariance $$Σ_t$$ converges to $$0$$.

Let’s run a simple simulation to test this. Suppose $a =$ , $b =$ , and $σ =$ . The resulting plot for a single sample path is shown below. You can generate another sample path by clicking

# References

The material for the example that certainty equivalence can fail is taken from Kumar (1983).

Kumar, P.R. 1983. Optimal adaptive control of linear-quadratic-gaussian systems. SIAM Journal on Control and Optimization 21, 2, 163–178. DOI: 10.1137/0321009.

This entry was last updated on 01 Aug 2021 and posted in RL and tagged linear systems, adaptive control, lqr.