11 MDP algorithms
11.1 Value Iteration Algorithm
11.2 Policy Iteration Algorithm
Lemma 11.1 (Policy improvement) Suppose \(V_π\) is the fixed point of \(\mathcal B_π\) and \[\mathcal B_{μ} V_π = \mathcal B V_π. \] Then, \[V_{μ}(s) \le V_π(s), \quad \forall s \in \ALPHABET S. \] Moreover, if \(π\) is not optimal, then at least one inequality is strict.
The policy improvement lemma (Lemma 11.1) implies that \(V_{k-1} \ge V_k\). Since the state and action spaces are finite, there are only a finite number of policies. The value function improves at each step. So the process must converge in finite number of iterations. At convergence, \(V_k = V_{k-1}\) and the policy improvement lemma implies that the corresponding policies \(π_k\) or \(π_{k-1}\) are optimal.
11.2.1 Policy iteration as Newton-Raphson algoritm
Recall that the main idea behind Newton-Raphson is as follows. Suppose we want to solve a fixed point equation \(V = \mathcal B V\) and we have an approximate solution \(V_k\). Then we can search for an improved soluiton \(V_{k+1} = V_k + Δ_k\) by setting \[\begin{equation} \label{eq:NR} V_k + Δ_k = \mathcal{B}( V_k + Δ_k ), \end{equation} \] expanding the right-hand side as far as first-order terms in \(Δ_k\) and solving the consequent linear equation for \(Δ_k\).
Now, let’s try to apply this idea to find the fixed point of the Bellman equation. Suppose we have identified a guess \(V_k\) and \(\mathcal B V_k = \mathcal B_{π_{k+1}} V_k\). Because the choice of control action \(a\) is optimization out in \(\mathcal B\), the varation of \(a\) induced by the variation \(Δ_k\) of \(V_k\) has no first-order effect on the value of \(\mathcal B(V_k + Δ_k)\). Therefore, \[ \mathcal{B}(V_k + Δ_k) = \mathcal B_{π_{k+1}}(V_k + Δ_k) + o(Δ_k). \] It follows that the linearized version of \eqref{eq:NR} is just \[ V_{k+1} = \mathcal B_{π_{k+1}} V_{k+1}. \] That is, \(V_{k+1}\) is just the value function for the policy \(π_{k+1}\), where \(π_{k+1}\) was deduced from the value function \(V_k\) exactly by the policy improvement procedure. Therefore, we can conclude the following.
Theorem 11.1 The policy improvement algorithm is equivalent to the application of Newton-Raphson algorithm to the fixed point equation \(V = \mathcal B V\) of dynamic programming.
The equivalence between policy iteration and Newton-Raphson partily explains why policy iteration approaches converge in few iterations.
11.3 Optimistic Policy Iteration
Note that if \(\ell_k = 1\), the optimistic policy iteration is equivalent to value iteration and if \(\ell_k = \infty\), then optimistic policy iteration is equal to policy iteration.
In the remainder of this section, we state the modifications of the main results to establish the convergence bounds for optimistic policy iteration.
Proposition 11.1 For any \(V \in \reals^n\) and \(m \ONES \in \reals_{\ge 0}\)
- If \(V + m \ONES \ge \mathcal B V = \mathcal B_π V\), then for any \(\ell \in \integers_{> 0}\), \[ \mathcal B V + \frac{γ}{1 - γ} m \ONES \ge \mathcal B_π^\ell V \] and \[ \mathcal B_π^\ell V + γ^\ell m \ONES \ge \mathcal B( \mathcal B_π^\ell V). \]
The proof is left as an exercise.
Proposition 11.2 Let \(\{(V_k, π_k)\}_{k \ge 0}\) be generated as per the optimistic policy iteration algorithm. Define \[ \alpha_k = \begin{cases} 1, & \text{if } k = 0 \\ γ^{\ell_0 + \ell_1 + \dots + \ell_{k-1}}, & \text{if } k > 0. \end{cases}\] Suppose there exists an \(m \in \reals\) such that \[ \| V_0 - \mathcal B V_0 \| \le m. \] Then, for all \(k \ge 0\), \[ \mathcal B V_{k+1} - \alpha_{k+1} m \le V_{k+1} \le \mathcal B V_k + \frac{γ}{1-γ} \alpha_k m. \] Moreover, \[ V_{k} - \frac{(k+1) γ^k}{1-γ} m \le V^* \le V_k + \frac{\alpha_k}{1 - γ} m \le V_k + \frac{γ^k}{1 - γ} m. \]
Exercises
Exercise 11.1 Show that the error bound for value iteration is monotone with the number of iterations, i.e, \[ V_k + \underline δ_k \ONES \le V_{k+1} + \underline δ_{k+1} \ONES \le V^* \le V_{k+1} + \bar δ_{k+1} \ONES \le V_k + \bar δ_k \ONES. \]
References
The techniques for value iteration and policy improvement were formalized by Howard (1960). The equivalence of policy improvement and the Newton-Raphson algorithm was demonstrated in the LQ case by Whittle and Komarova (1988), for which it holds in a tighter sense.