13 MDP algorithms

Author

Affiliation

Updated

May 21, 2025

There are three main algorithms for solving infinite horizon MDPs: value iteration, policy iteration, and linear programming. In this section, we present value and policy iteration (and their combination which is called optimistic policy iteration). The linear programming approach is described in a later section

13.1 Value Iteration Algorithm

Value Iteration Algorithm

Start with any \(V_0 \in \reals^n\).
Recursively compute \(π_k = \GREEDY(V_k)\) and \(V_{k+1} = \BELLMAN^{π_k} V_k = \BELLMAN^* V_k\).
Define \[ \begin{align*} \underline δ_k &= \frac{γ}{1-γ} \min_s \{ V_k(s) - V_{k-1}(s) \}, \\ \bar δ_k &= \frac{γ}{1-γ} \max_s \{ V_k(s) - V_{k-1}(s) \}. \end{align*} \]

Then, for all \(k\)

\(V_k + \underline δ_k \ONES \le V^* \le V_k + \bar δ_k \ONES\).
\((\underline δ_k - \bar δ_k) \ONES \le V^{π_k} - V^* \le (\bar δ_k - \underline δ_k) \ONES.\)

Proof

By construction, \[ \begin{align*} \BELLMAN^* V_k - V_k &= \BELLMAN^* V_k - \BELLMAN^* V_{k-1} \\ & \le \BELLMAN^{π_{k-1}} V_k - \BELLMAN^{π_{k-1}} V_{k-1}\\ & \le γ P_{π_{k-1}}[ V_k - V_{k-1} ] \\ &= (1-γ) \bar δ_k \ONES. \end{align*} \] Thus, by Proposition 12.6, we have \[\begin{equation} \label{eq:VI-1} V^* \le V_k + \bar δ_k \ONES. \end{equation}\] Note that \(\BELLMAN^* V_k = \BELLMAN^{π_k} V_k\). So, we have also show that \(\BELLMAN^{π_k} V_k - V_k \le (1-γ) \bar δ_k \ONES\). Thus, again by Proposition 12.6, we have \[\begin{equation}\label{eq:VI-2} V^{π_k} \le V_k + \bar δ_k \ONES. \end{equation}\]

By a similar argument, we can show \[\begin{equation}\label{eq:VI-3} V^* \ge V_k + \underline δ_k \ONES \quad\text{and}\quad V^{π_k} \ge V_k + \underline δ_k \ONES. \end{equation}\]

Eq. \eqref{eq:VI-1} and \eqref{eq:VI-3} imply the first relationship of the result. To establish the second relationship, note that the triangle inequality \[ \max\{ V^{π_k} - V^* \} \le \max\{ V^{π_k} - V_k \} + \max\{ V_{k} - V^* \} \le (\bar δ_k - \underline δ_k) \ONES. \] Similarly, \[ \max\{ V^* - V^{π_k} \} \le \max \{ V^* - V_k \} + \max\{ V_k - V^{π_k} \} \le (\bar δ_k - \underline δ_k) \ONES. \] Combining the above two equation, we get the second relationship of the result.

13.2 Policy Iteration Algorithm

Lemma 13.1 (Policy improvement) Suppose \(V^π\) is the fixed point of \(\BELLMAN^π\) and \(μ = \GREEDY(V^{π})\). Then, \[V^{μ}(s) \le V^π(s), \quad \forall s \in \ALPHABET S. \] Moreover, if \(π\) is not optimal, then at least one inequality is strict.

Proof

\[ V^π = \BELLMAN^π V^π \ge \BELLMAN^* V^π = \BELLMAN^{μ} V^π.\] Thus, \[ V^π \ge V^{μ}. \]

Finally, suppose \(V^μ = V^π\). Then, \[ V^μ = \BELLMAN^μ V^μ = \BELLMAN^μ V^π = \BELLMAN^* V^π = \BELLMAN V^μ. \] Thus, \(V^μ\) (and \(V^π\)) is the unique fixed point of \(\BELLMAN^*\). Hence \(μ\) and \(π\) are optimal.

Policy Iteration Algorithm

Start with an arbitrary policy \(π_0\). Compute \(V_0 = \BELLMAN^{π_0} V_0\).
Recursively compute a policy \(π_k\) such that \[ π_k \in \GREEDY(V_{k-1})\] and compute the performance of the policy using \[ V_k = \BELLMAN^{π_k} V_k.\]
Stop if \(V_k = V_{k-1}\) (or \(π_k = π_{k-1}\)).

The policy improvement lemma (Lemma 13.1) implies that \(V_{k-1} \ge V_k\). Since the state and action spaces are finite, there are only a finite number of policies. The value function improves at each step. So the process must converge in finite number of iterations. At convergence, \(V_k = V_{k-1}\) and the policy improvement lemma implies that the corresponding policies \(π_k\) or \(π_{k-1}\) are optimal.

13.2.1 Policy iteration as Newton-Raphson algoritm

Recall that the main idea behind Newton-Raphson is as follows. Suppose we want to solve a fixed point equation \(V = \BELLMAN^* V\) and we have an approximate solution \(V_k\). Then we can search for an improved soluiton \(V_{k+1} = V_k + Δ_k\) by setting \[\begin{equation} \label{eq:NR} V_k + Δ_k = \mathcal{B}( V_k + Δ_k ), \end{equation} \] expanding the right-hand side as far as first-order terms in \(Δ_k\) and solving the consequent linear equation for \(Δ_k\).

Now, let’s try to apply this idea to find the fixed point of the Bellman equation. Suppose we have identified a guess \(V_k\) and \(\BELLMAN^* V_k = \BELLMAN^{π_{k+1}} V_k\). Because the choice of control action \(a\) is optimization out in \(\BELLMAN^*\), the varation of \(a\) induced by the variation \(Δ_k\) of \(V_k\) has no first-order effect on the value of \(\BELLMAN^*(V_k + Δ_k)\). Therefore, \[ \mathcal{B}(V_k + Δ_k) = \BELLMAN^{π_{k+1}}(V_k + Δ_k) + o(Δ_k). \] It follows that the linearized version of \eqref{eq:NR} is just \[ V_{k+1} = \BELLMAN^{π_{k+1}} V_{k+1}. \] That is, \(V_{k+1}\) is just the value function for the policy \(π_{k+1}\), where \(π_{k+1}\) was deduced from the value function \(V_k\) exactly by the policy improvement procedure. Therefore, we can conclude the following.

Theorem 13.1 The policy improvement algorithm is equivalent to the application of Newton-Raphson algorithm to the fixed point equation \(V = \BELLMAN^* V\) of dynamic programming.

The equivalence between policy iteration and Newton-Raphson partily explains why policy iteration approaches converge in few iterations.

13.3 Optimistic Policy Iteration

Optimistic Policy Iteration Algorithm

Fix a sequence of integers \(\{\ell_k\}_{k \in \integers_{\ge 0}}\).
Start with an initial guess \(V_0 \in \reals^n\).
For \(k=0, 1, 2, \dots\), recursively compute a policy \(π_k\) such that \[ π_k \in \GREEDY(V_k) \] and then update the value function \[ V_{k+1} = (\BELLMAN^{π_k})^{\ell_k} V_k. \]

Note that if \(\ell_k = 1\), the optimistic policy iteration is equivalent to value iteration and if \(\ell_k = \infty\), then optimistic policy iteration is equal to policy iteration.

In the remainder of this section, we state the modifications of the main results to establish the convergence bounds for optimistic policy iteration.

Proposition 13.1 For any \(V \in \reals^n\) and \(m \ONES \in \reals_{\ge 0}\)

If \(V + m \ONES \ge \BELLMAN^* V = \BELLMAN^π V\), then for any \(\ell \in \integers_{> 0}\), \[ \BELLMAN^* V + \frac{γ}{1 - γ} m \ONES \ge (\BELLMAN^π)^\ell V \] and \[ (\BELLMAN^π)^\ell V + γ^\ell m \ONES \ge \BELLMAN^*( (\BELLMAN^π)^\ell V). \]

The proof is left as an exercise.

Proposition 13.2 Let \(\{(V_k, π_k)\}_{k \ge 0}\) be generated as per the optimistic policy iteration algorithm. Define \[ \alpha_k = \begin{cases} 1, & \text{if } k = 0 \\ γ^{\ell_0 + \ell_1 + \dots + \ell_{k-1}}, & \text{if } k > 0. \end{cases}\] Suppose there exists an \(m \in \reals\) such that \[ \| V_0 - \BELLMAN^* V_0 \| \le m. \] Then, for all \(k \ge 0\), \[ \BELLMAN^* V_{k+1} - \alpha_{k+1} m \le V_{k+1} \le \BELLMAN^* V_k + \frac{γ}{1-γ} \alpha_k m. \] Moreover, \[ V_{k} - \frac{(k+1) γ^k}{1-γ} m \le V^* \le V_k + \frac{\alpha_k}{1 - γ} m \le V_k + \frac{γ^k}{1 - γ} m. \]

Exercises

Exercise 13.1 Show that the error bound for value iteration is monotone with the number of iterations, i.e, \[ V_k + \underline δ_k \ONES \le V_{k+1} + \underline δ_{k+1} \ONES \le V^* \le V_{k+1} + \bar δ_{k+1} \ONES \le V_k + \bar δ_k \ONES. \]

Notes

The value iteration algorithm is due to Bellman (1957). The policy improvement and policy iteration algorithms are due to Howard (1960). The equivalence of policy improvement and the Newton-Raphson algorithm was demonstrated in the LQ case by Whittle and Komarova (1988), for which it holds in a tighter sense.