# ECSE 506: Stochastic Control and Decision Theory

Theory: Approximate dynamic programming

The value and policy iteration algorithms for discounted cost MDPs rely on exact computation of the Bellman update $$W = \mathcal B V$$ and the corresponding optimal policy $$g$$ such that $$\mathcal B V = \mathcal B_π V$$. Suppose we cannot compute these updates exactly, but can find approximate solutions $$W$$ and $$g$$ such that $\NORM{W - \mathcal B V} \le δ \quad\text{and}\quad \NORM{\mathcal B_π V - \mathcal B V} \le ε$ where $$δ$$ and $$ε$$ are positive constants. (In general, these constants are unknown, so the results are quantitative in nature).

The error $$δ$$ may be non-zero due to state aggregation in large state spaces, or using simulation to compute the Bellman update, or using least square fit to approximate the value function.

The error $$ε$$ may be non-zero due to large or infinite action space.

Often $$δ > 0$$ and $$ε = 0$$. We may also first pick a policy $$g$$ such that $$\NORM{\mathcal B_π V - \mathcal B V} \le ε$$ and then set $$W = \mathcal B_π V$$, in which case $$δ = ε$$.

# 1 Approximate value iteration

Theorem 1

Generate $$\{V_k\}_{k \ge 0}$$ and $$\{π_k\}_{k \ge 0}$$ such that $\NORM{V_{k+1} - \mathcal B V_k} \le δ \quad\text{and}\quad \NORM{\mathcal B_{π_k} V_k - \mathcal B V_k} \le ε.$ Then,

1. $$\displaystyle \lim_{k \to ∞} \NORM{V_k - V^*} \le \frac {δ}{(1-γ)}.$$
2. $$\displaystyle \lim_{k \to ∞} \NORM{V_{π_k} - V^*} \le \frac {ε}{(1-γ)} + \frac{2γδ}{(1-γ)^2}.$$

If we use a periodic policy with period $$M$$, then the above bound can be improved by a factor of $$1/(1-γ)$$.

#### Proof

To prove the first part, note that repeatedly combining the contraction property of the Bellman operator with the fact that $$\NORM{V_{k+1} - \mathcal B V_k} \le δ$$, we get that $$$\label{eq:B1} \NORM{\mathcal B^m V_{k+1} - \mathcal B^{m+1} V_k} \le γ^m δ.$$$

Now, from the triangle inequality, we have that \begin{align*} \NORM{V_k - \mathcal B^k V_0} &\le \NORM{V_k - \mathcal B V_{k-1}} + \NORM{\mathcal B V_{k-1} - \mathcal B^2 V_{k-2}} + \cdots + \NORM{B^{k-1} V_1 - \mathcal B^k V_0} \\ &\stackrel{(a)}\le δ + γδ + \dots + γ^{k-1}δ \\ &= \left(\frac{1 - γ^k}{1-γ}\right) δ, \end{align*} where $$(a)$$ follows from \eqref{eq:B1}. Taking the limit as $$k \to ∞$$ gives the first result.

Now, to prove the second part, we again apply the triangle inequality \begin{align*} \NORM{\mathcal B_{π_k} V^* - V^*} &\le \NORM{\mathcal B_{π_k} V^* - \mathcal B_{π_k} V_k} + \NORM{\mathcal B_{π_k} V_k - \mathcal B V_k} + \NORM{\mathcal B V_k - V^*} \\ &\stackrel{(b)}\le γ \NORM{V^* - V_k} + ε + γ \NORM{V_k - V^*} \\ &\stackrel{(c)}\le ε + \frac{2γδ}{(1-γ)} =: m, \end{align*} where the first term in $$(b)$$ uses the contraction property, the second term uses the fact that $$π_k$$ is an $$ε$$-optimal policy and the third term uses the fact that $$V^*$$ is the fixed point of $$\mathcal B$$ and thus $$V^* = \mathcal B V^*$$ and then uses the contraction property. The inequality in $$(c)$$ use the result from the first part.

Now, from the discounting property of the Bellman operator, $$\NORM{\mathcal B_{π_k} V^* - V^*} \le m$$ implies $\NORM{V_{π_k} - V^*} \le \frac{m}{(1-γ)}$ which proves the second part.

# 2 Approximate policy iteration

Before stating the approximate policy iteration algorithm, we state a preliminary result that serves as the main step in proving the error bounds for approximate policy iteration.

Prop. 1

Suppose $$V$$, $$g$$, and $$h$$ satisfy $\NORM{V - V_π} \le δ \quad\text{and}\quad \NORM{\mathcal B_h V - \mathcal B V} \le ε.$ Then, $\NORM{V_h - V^*} \le γ \NORM{V_π - V^*} + \frac{ε + 2γδ}{(1-γ)}.$

#### Proof

From the bounds on $$V$$, $$g$$, and $$h$$ and the discounting property of the Bellman operator, we have that $$$\label{eq:P1} \mathcal B_h V_π \le \mathcal B_h V + γδ \le \mathcal B V + ε + γ δ.$$$

Again, from the bounds on $$V$$ and $$g$$ and the discounting property of the Bellman operator, we have $$\mathcal B V \le \mathcal B V_π + γδ$$. Thus, $$$\label{eq:P2} \mathcal B_h V_π \le \mathcal B V_π + ε + 2γδ$$$ For ease of notation, let $$m := ε + 2γδ$$.

Moreover, from the definition of the Bellman operator $\mathcal B V_π \le \mathcal B_π V_π = V_π.$ Substituting the above in \eqref{eq:P2}, we get that $\mathcal B_h V_π \le V_π + m.$ Therefore, by the discounting property of Bellman operator, we get $$$\label{eq:P3} V_h \le V_π + \frac{m}{(1-γ)}.$$$

Using \eqref{eq:P3} and the discounting property, we get that $V_h = \mathcal B_h V_h = \mathcal B_h V_π + \big( \mathcal B_h V_h - \mathcal B_h V_π \big) \le \mathcal B_h V_π + γ \frac{m}{(1-γ)}.$

Subtracting $$V^*$$ from both sides we get \begin{align*} V_h - V^* &\le \mathcal B_h V_π - V^* + \frac{mγ}{(1-γ)} \\ &\stackrel{(a)}\le \mathcal B V_π + m - V^* + \frac{mγ}{(1-γ)} \\ &\stackrel{(b)}= \mathcal B V_π - \mathcal B V^* + \frac{m}{(1-γ)} \\ &\stackrel{(c)}{\le} γ \NORM{V_π - V^*} + \frac{m}{(1-γ)}, \end{align*} where $$(a)$$ follows from \eqref{eq:P1}, $$(b)$$ uses the fact that $$V^*$$ is the fixed point of $$\mathcal B$$ and $$(c)$$ uses the contraction property. Substituting the value of $$m$$ in the above equation gives the result.

Theorem 2

Generate a sequence $$\{π_k\}_{k \ge 0}$$ and $$\{V_k\}_{k \ge 0}$$ such that $\NORM{V_k - V_{π_k}} \le δ \quad\text{and}\quad \NORM{\mathcal B_{π_k} V_k - \mathcal B V_k} \le ε.$ Then, $\limsup_{k\to ∞} \NORM{V_{π_k} - V^*} \le \frac{ε+2γδ}{(1-γ)^2}.$

Remark
• Both approximate VI and approximate PI have similar error bounds (proportional to $$1/(1-γ)^2$$.)
• When $$ε = δ = 0$$, then Prop. 1 implies that $$\NORM{V_{π_{k+1}} - V^*} \le γ \NORM{V_{π_k} - V^*}$$. Thus, standard policy iteration has a geometeric rate of convergence (similar to value iteration), though in practice it converges much faster.

#### Proof

From Prop. 1, we have

$\NORM{V_{π_{k+1}} - V^*} \le γ \NORM{V_{π_k} - V^*} + \frac{ε + 2γδ}{(1-γ)}.$

The result follows from taking the limit $$k \to ∞$$.

Prop. 2

If the successive policies in approximate policy iteration converge1, i.e.  $π_{k+1} = π_k = g, \quad \text{for some k}.$ Then, $\NORM{V_π - V^*} \le \frac{ε + 2γδ}{(1-γ)}.$

#### Proof

Let $$V$$ be the approximate value function obtained at iteration $$k$$, i.e., $\NORM{V - V_π} \le δ \quad\text{and}\quad \NORM{\mathcal B_π V - \mathcal V} \le ε.$

Then, from the triangle inequality, we have \begin{align*} \NORM{\mathcal B V_π - V_π } &\le \NORM{\mathcal B V_π - \mathcal B V} + \NORM{\mathcal B V - \mathcal B_π V} + \NORM{\mathcal B_π V - \mathcal B_π V_π} \\ &\stackrel{(a)}\le γ\NORM{V_π - V} + ε + γ \NORM{V - V_π} \\ &\stackrel{(b)}\le ε + 2γδ, \end{align*} where $$(a)$$ follows from the fact that $$V_π = \mathcal B_π V_π$$ and the contraction property and $$(b)$$ follows from the assumption on $$V$$. Now, from the discounting property, we get the result.

Prop. 3

Suppose the successive value functions obtained by approximate policy iteration are “not too different”, i.e., $\NORM{V - V_π} \le δ, \quad \NORM{B_h V - \mathcal B V} \le ε, \quad\text{and}\quad \NORM{B_π V - \mathcal B_h V} \le ζ.$ Then, $\NORM{V_π - V^*} \le \frac{ε + ζ + 2γδ}{(1-γ)}.$

#### Proof

The result follows by replacing $$ε$$ in $$(a)$$ above by $$ε+ζ$$.

# References

The results presented in this section are taken from Bertsekas (2013).

Bertsekas, D.P. 2013. Abstract dynamic programming. Athena Scientific Belmont. Available at: https://web.mit.edu/dimitrib/www/abstractdp_MIT.html.

1. In general the policies may not converge.↩︎

This entry was last updated on 16 Dec 2021 and posted in MDP and tagged infinite horizon, discounted cost, value iteration, policy iteration, approximation bounds, adp.