23  Approximate dynamic programming

Updated

June 8, 2023

Keywords

infinite horizon, discounted cost, value iteration, policy iteration, approximation bounds, ADP

The value and policy iteration algorithms for discounted cost MDPs rely on exact computation of the Bellman update \(W = \mathcal B V\) and the corresponding optimal policy \(π\) such that \(\mathcal B V = \mathcal B_π V\). Suppose we cannot compute these updates exactly, but can find approximate solutions \(W\) and \(π\) such that \[ \NORM{W - \mathcal B V} \le δ \quad\text{and}\quad \NORM{\mathcal B_π V - \mathcal B V} \le ε\] where \(δ\) and \(ε\) are positive constants. (In general, these constants are unknown, so the results are quantitative in nature).

The error \(δ\) may be non-zero due to state aggregation in large state spaces, or using simulation to compute the Bellman update, or using least square fit to approximate the value function.

The error \(ε\) may be non-zero due to large or infinite action space.

Often \(δ > 0\) and \(ε = 0\). We may also first pick a policy \(π\) such that \(\NORM{\mathcal B_π V - \mathcal B V} \le ε\) and then set \(W = \mathcal B_π V\), in which case \(δ = ε\).

23.1 Approximate value iteration

Theorem 23.1 Generate \(\{V_k\}_{k \ge 0}\) and \(\{π_k\}_{k \ge 0}\) such that \[\NORM{V_{k+1} - \mathcal B V_k} \le δ \quad\text{and}\quad \NORM{\mathcal B_{π_k} V_k - \mathcal B V_k} \le ε. \] Then,

  1. \(\displaystyle \lim_{k \to ∞} \NORM{V_k - V^*} \le \frac {δ}{(1-γ)}.\)
  2. \(\displaystyle \lim_{k \to ∞} \NORM{V_{π_k} - V^*} \le \frac {ε}{(1-γ)} + \frac{2γδ}{(1-γ)^2}.\)

If we use a periodic policy with period \(M\), then the above bound can be improved by a factor of \(1/(1-γ)\).

To prove the first part, note that repeatedly combining the contraction property of the Bellman operator (Proposition 12.2) with the fact that \(\NORM{V_{k+1} - \mathcal B V_k} \le δ\), we get that \[\begin{equation}\label{eq:B1} \NORM{\mathcal B^m V_{k+1} - \mathcal B^{m+1} V_k} \le γ^m δ. \end{equation}\]

Now, from the triangle inequality, we have that \[\begin{align*} \NORM{V_k - \mathcal B^k V_0} &\le \NORM{V_k - \mathcal B V_{k-1}} + \NORM{\mathcal B V_{k-1} - \mathcal B^2 V_{k-2}} + \cdots + \NORM{B^{k-1} V_1 - \mathcal B^k V_0} \\ &\stackrel{(a)}\le δ + γδ + \dots + γ^{k-1}δ \\ &= \left(\frac{1 - γ^k}{1-γ}\right) δ, \end{align*}\] where \((a)\) follows from \eqref{eq:B1}. Taking the limit as \(k \to ∞\) gives the first result.

Now, to prove the second part, we again apply the triangle inequality \[\begin{align*} \NORM{\mathcal B_{π_k} V^* - V^*} &\le \NORM{\mathcal B_{π_k} V^* - \mathcal B_{π_k} V_k} + \NORM{\mathcal B_{π_k} V_k - \mathcal B V_k} + \NORM{\mathcal B V_k - V^*} \\ &\stackrel{(b)}\le γ \NORM{V^* - V_k} + ε + γ \NORM{V_k - V^*} \\ &\stackrel{(c)}\le ε + \frac{2γδ}{(1-γ)} =: m, \end{align*}\] where the first term in \((b)\) uses the contraction property, the second term uses the fact that \(π_k\) is an \(ε\)-optimal policy and the third term uses the fact that \(V^*\) is the fixed point of \(\mathcal B\) and thus \(V^* = \mathcal B V^*\) and then uses the contraction property. The inequality in \((c)\) use the result from the first part.

Now, from the discounting property of the Bellman operator (Proposition 12.4), \(\NORM{\mathcal B_{π_k} V^* - V^*} \le m\) implies \[ \NORM{V_{π_k} - V^*} \le \frac{m}{(1-γ)}\] which proves the second part.

23.2 Approximate policy iteration

Before stating the approximate policy iteration algorithm, we state a preliminary result that serves as the main step in proving the error bounds for approximate policy iteration.

Proposition 23.1 Suppose \(V\), \(π\), and \(h\) satisfy \[ \NORM{V - V_π} \le δ \quad\text{and}\quad \NORM{\mathcal B_h V - \mathcal B V} \le ε. \] Then, \[ \NORM{V_h - V^*} \le γ \NORM{V_π - V^*} + \frac{ε + 2γδ}{(1-γ)}. \]

From the bounds on \(V\), \(π\), and \(h\) and the discounting property of the Bellman operator (Proposition 12.4), we have that \[\begin{equation}\label{eq:P1} \mathcal B_h V_π \le \mathcal B_h V + γδ \le \mathcal B V + ε + γ δ. \end{equation}\]

Again, from the bounds on \(V\) and \(π\) and the discounting property of the Bellman operator, we have \(\mathcal B V \le \mathcal B V_π + γδ\). Thus, \[\begin{equation}\label{eq:P2} \mathcal B_h V_π \le \mathcal B V_π + ε + 2γδ \end{equation}\] For ease of notation, let \(m := ε + 2γδ\).

Moreover, from the definition of the Bellman operator \[ \mathcal B V_π \le \mathcal B_π V_π = V_π.\] Substituting the above in \eqref{eq:P2}, we get that \[ \mathcal B_h V_π \le V_π + m. \] Therefore, by the discounting property of Bellman operator, we get \[\begin{equation}\label{eq:P3} V_h \le V_π + \frac{m}{(1-γ)}. \end{equation}\]

Using \eqref{eq:P3} and the discounting property, we get that \[V_h = \mathcal B_h V_h = \mathcal B_h V_π + \big( \mathcal B_h V_h - \mathcal B_h V_π \big) \le \mathcal B_h V_π + γ \frac{m}{(1-γ)}. \]

Subtracting \(V^*\) from both sides we get \[\begin{align*} V_h - V^* &\le \mathcal B_h V_π - V^* + \frac{mγ}{(1-γ)} \\ &\stackrel{(a)}\le \mathcal B V_π + m - V^* + \frac{mγ}{(1-γ)} \\ &\stackrel{(b)}= \mathcal B V_π - \mathcal B V^* + \frac{m}{(1-γ)} \\ &\stackrel{(c)}{\le} γ \NORM{V_π - V^*} + \frac{m}{(1-γ)}, \end{align*}\] where \((a)\) follows from \eqref{eq:P1}, \((b)\) uses the fact that \(V^*\) is the fixed point of \(\mathcal B\) and \((c)\) uses the contraction property. Substituting the value of \(m\) in the above equation gives the result.

Theorem 23.2 Generate a sequence \(\{π_k\}_{k \ge 0}\) and \(\{V_k\}_{k \ge 0}\) such that \[ \NORM{V_k - V_{π_k}} \le δ \quad\text{and}\quad \NORM{\mathcal B_{π_k} V_k - \mathcal B V_k} \le ε.\] Then, \[ \limsup_{k\to ∞} \NORM{V_{π_k} - V^*} \le \frac{ε+2γδ}{(1-γ)^2}. \]

Remark
  • Both approximate VI and approximate PI have similar error bounds (proportional to \(1/(1-γ)^2\).)
  • When \(ε = δ = 0\), then Proposition 23.1 implies that \(\NORM{V_{π_{k+1}} - V^*} \le γ \NORM{V_{π_k} - V^*}\). Thus, standard policy iteration has a geometeric rate of convergence (similar to value iteration), though in practice it converges much faster.

From Proposition 23.1 we have

\[ \NORM{V_{π_{k+1}} - V^*} \le γ \NORM{V_{π_k} - V^*} + \frac{ε + 2γδ}{(1-γ)}.\]

The result follows from taking the limit \(k \to ∞\).

Proposition 23.2 If the successive policies in approximate policy iteration converge (in general, it may not), i.e.  \[ π_{k+1} = π_k = π, \quad \text{for some $k$}. \] Then, \[ \NORM{V_π - V^*} \le \frac{ε + 2γδ}{(1-γ)}. \]

Let \(V\) be the approximate value function obtained at iteration \(k\), i.e., \[\NORM{V - V_π} \le δ \quad\text{and}\quad \NORM{\mathcal B_π V - \mathcal V} \le ε.\]

Then, from the triangle inequality, we have \[\begin{align*} \NORM{\mathcal B V_π - V_π } &\le \NORM{\mathcal B V_π - \mathcal B V} + \NORM{\mathcal B V - \mathcal B_π V} + \NORM{\mathcal B_π V - \mathcal B_π V_π} \\ &\stackrel{(a)}\le γ\NORM{V_π - V} + ε + γ \NORM{V - V_π} \\ &\stackrel{(b)}\le ε + 2γδ, \end{align*}\] where \((a)\) follows from the fact that \(V_π = \mathcal B_π V_π\) and the contraction property and \((b)\) follows from the assumption on \(V\). Now, from the discounting property, we get the result.

Proposition 23.3 Suppose the successive value functions obtained by approximate policy iteration are “not too different”, i.e., \[ \NORM{V - V_π} \le δ, \quad \NORM{B_h V - \mathcal B V} \le ε, \quad\text{and}\quad \NORM{B_π V - \mathcal B_h V} \le ζ.\] Then, \[ \NORM{V_π - V^*} \le \frac{ε + ζ + 2γδ}{(1-γ)}. \]

The result follows by replacing \(ε\) in \((a)\) above by \(ε+ζ\).

Notes

The results presented in this section are taken from Bertsekas (2013).