16 MDP algorithms
There are three main algorithms for solving infinite horizon MDPs: value iteration, policy iteration, and linear programming. In this section, we present value and policy iteration (and their combination which is called optimistic policy iteration). The linear programming approach is described in a later section
16.1 Value Iteration Algorithm
By construction, \[ \begin{align*} \BELLMAN^* V^{(k)} - V^{(k)} &= \BELLMAN^* V^{(k)} - \BELLMAN^* V^{(k-1)} \\ & \le \BELLMAN^{π^{(k-1)}} V^{(k)} - \BELLMAN^{π^{(k-1)}} V^{(k-1)}\\ & \le γ P_{π^{(k-1)}}[ V^{(k)} - V^{(k-1)} ] \\ &= (1-γ) \bar δ^{(k)} \ONES. \end{align*} \] Thus, by Proposition 15.7, we have \[\begin{equation} \label{eq:VI-1} V^\star \le V^{(k)} + \bar δ^{(k)} \ONES. \end{equation}\] Note that \(\BELLMAN^* V^{(k)} = \BELLMAN^{π^{(k)}} V^{(k)}\). So, we have also show that \(\BELLMAN^{π^{(k)}} V^{(k)} - V^{(k)} \le (1-γ) \bar δ^{(k)} \ONES\). Thus, again by Proposition 15.7, we have \[\begin{equation}\label{eq:VI-2} V^{π^{(k)}} \le V^{(k)} + \bar δ^{(k)} \ONES. \end{equation}\]
By a similar argument, we can show \[\begin{equation}\label{eq:VI-3} V^\star \ge V^{(k)} + \underline δ^{(k)} \ONES \quad\text{and}\quad V^{π^{(k)}} \ge V^{(k)} + \underline δ^{(k)} \ONES. \end{equation}\]
Eq. \eqref{eq:VI-1} and \eqref{eq:VI-3} imply the first relationship of the result. To establish the second relationship, note that the triangle inequality \[ \max\{ V^{π^{(k)}} - V^\star \} \le \max\{ V^{π^{(k)}} - V^{(k)} \} + \max\{ V^{(k)} - V^\star \} \le (\bar δ^{(k)} - \underline δ^{(k)}) \ONES. \] Similarly, \[ \max\{ V^\star - V^{π^{(k)}} \} \le \max \{ V^\star - V^{(k)} \} + \max\{ V^{(k)} - V^{π^{(k)}} \} \le (\bar δ^{(k)} - \underline δ^{(k)}) \ONES. \] Combining the above two equation, we get the second relationship of the result.
Some remarks
An implication of Proposition 15.5 is that if the initial estimate \(V^{(0)} < V^\star\), then \(\{V^{(k)}\}_{k \ge 0}\) is a monotone increasing sequence that converges to \(V^\star\). Similarly, if \(V^{(0)} > V^\star\), then \(\{V^{(k)}\}_{k \ge 0}\) is a monotone decreasing sequence that converges to \(V^\star\).
By the same argument, if at some stage \(V^{(k)} \ge V^{(k+1)}\), then for all \(\ell \ge k\), we have \(V^{(\ell)} \le V^{(\ell+1)}\). Similar behavior will be observed if the inequalities are reversed.
16.2 Policy Iteration Algorithm
Lemma 16.1 (Policy improvement) Suppose \(V^π\) is the fixed point of \(\BELLMAN^π\) and \(μ = \GREEDY(V^{π})\). Then, \[V^{μ}(s) \le V^π(s), \quad \forall s \in \ALPHABET S. \] Moreover, if \(π\) is not optimal, then at least one inequality is strict.
\[ V^π = \BELLMAN^π V^π \ge \BELLMAN^* V^π = \BELLMAN^{μ} V^π.\] Thus, \[ V^π \ge V^{μ}. \]
Finally, suppose \(V^μ = V^π\). Then, \[ V^μ = \BELLMAN^μ V^μ = \BELLMAN^μ V^π = \BELLMAN^* V^π = \BELLMAN V^μ. \] Thus, \(V^μ\) (and \(V^π\)) is the unique fixed point of \(\BELLMAN^*\). Hence \(μ\) and \(π\) are optimal.
The policy improvement lemma (Lemma 16.1) implies that \(V^{(k-1)} \ge V^{(k)}\). Since the state and action spaces are finite, there are only a finite number of policies. The value function improves at each step. So the process must converge in finite number of iterations. At convergence, \(V^{(k)} = V^{(k-1)}\) and the policy improvement lemma implies that the corresponding policies \(π^{(k)}\) or \(π^{(k-1)}\) are optimal.
16.2.1 Policy iteration as Newton-Raphson algoritm
Recall that the main idea behind Newton-Raphson is as follows. Suppose we want to solve a fixed point equation \(V = \BELLMAN^* V\) and we have an approximate solution \(V^{(k)}\). Then we can search for an improved soluiton \(V^{(k+1)} = V^{(k)} + Δ^{(k)}\) by setting \[\begin{equation} \label{eq:NR} V^{(k)} + Δ^{(k)} = \mathcal{B}( V^{(k)} + Δ^{(k)} ), \end{equation} \] expanding the right-hand side as far as first-order terms in \(Δ^{(k)}\) and solving the consequent linear equation for \(Δ^{(k)}\).
Now, let’s try to apply this idea to find the fixed point of the Bellman equation. Suppose we have identified a guess \(V^{(k)}\) and \(\BELLMAN^* V^{(k)} = \BELLMAN^{π_{k+1}} V^{(k)}\). Because the choice of control action \(a\) is optimization out in \(\BELLMAN^*\), the varation of \(a\) induced by the variation \(Δ^{(k)}\) of \(V^{(k)}\) has no first-order effect on the value of \(\BELLMAN^*(V^{(k)} + Δ^{(k)})\). Therefore, \[ \mathcal{B}(V^{(k)} + Δ^{(k)}) = \BELLMAN^{π^{(k+1)}}(V^{(k)} + Δ^{(k)}) + o(Δ^{(k)}). \] It follows that the linearized version of \eqref{eq:NR} is just \[ V^{(k+1)} = \BELLMAN^{π^{(k+1)}} V^{(k+1)}. \] That is, \(V^{(k+1)}\) is just the value function for the policy \(π^{(k+1)}\), where \(π^{(k+1)}\) was deduced from the value function \(V^{(k)}\) exactly by the policy improvement procedure. Therefore, we can conclude the following.
Theorem 16.1 The policy improvement algorithm is equivalent to the application of Newton-Raphson algorithm to the fixed point equation \(V = \BELLMAN^* V\) of dynamic programming.
The equivalence between policy iteration and Newton-Raphson partily explains why policy iteration approaches converge in few iterations.
16.3 Optimistic Policy Iteration
Note that if \(\ell_k = 1\), the optimistic policy iteration is equivalent to value iteration and if \(\ell_k = \infty\), then optimistic policy iteration is equal to policy iteration.
In the remainder of this section, we state the modifications of the main results to establish the convergence bounds for optimistic policy iteration.
Proposition 16.1 For any \(V \in \reals^n\) and \(m \ONES \in \reals_{\ge 0}\)
- If \(V + m \ONES \ge \BELLMAN^* V = \BELLMAN^π V\), then for any \(\ell \in \integers_{> 0}\), \[ \BELLMAN^* V + \frac{γ}{1 - γ} m \ONES \ge (\BELLMAN^π)^\ell V \] and \[ (\BELLMAN^π)^\ell V + γ^\ell m \ONES \ge \BELLMAN^*( (\BELLMAN^π)^\ell V). \]
The proof is left as an exercise.
Proposition 16.2 Let \(\{(V^{(k)}, π^{(k)})\}_{k \ge 0}\) be generated as per the optimistic policy iteration algorithm. Define \[ \alpha_k = \begin{cases} 1, & \text{if } k = 0 \\ γ^{\ell_0 + \ell_1 + \dots + \ell_{k-1}}, & \text{if } k > 0. \end{cases}\] Suppose there exists an \(m \in \reals\) such that \[ \| V^{(0)} - \BELLMAN^* V^{(0)} \| \le m. \] Then, for all \(k \ge 0\), \[ \BELLMAN^* V^{(k+1)} - \alpha_{k+1} m \le V^{(k+1)} \le \BELLMAN^* V^{(k)} + \frac{γ}{1-γ} \alpha_k m. \] Moreover, \[ V^{(k)} - \frac{(k+1) γ^k}{1-γ} m \le V^\star \le V^{(k)} + \frac{\alpha_k}{1 - γ} m \le V^{(k)} + \frac{γ^k}{1 - γ} m. \]
Exercises
Exercise 16.1 Show that the error bound for value iteration is monotone with the number of iterations, i.e, \[ V^{(k)} + \underline δ^{(k)} \ONES \le V^{(k+1)} + \underline δ^{(k+1)} \ONES \le V^\star \le V^{(k+1)} + \bar δ^{(k+1)} \ONES \le V^{(k)} + \bar δ^{(k)} \ONES. \]
Exercise 16.2 Consider the infinite horizon discounted version of the inventory management problem with real valued state space. Assume that the demand is exponential with rate \(1/μ\). Assume that the parameters of the model are: \[ c_h = 1, \quad c_s = 4, \quad p = 1, \quad γ = 0.9, \quad μ = 5. \]
To numerically solve the problem, we truncate the state space to \([-B, B]\), the post-decision state to \([-B, 2B]\), and the action space to \([0, 2B]\). Thus, we consider the truncated dynamics where \[ Z_t = [S_t + A_t]_{-B}^{2B} \quad S_{t+1} = [Z_t - W_t]_{-B}^B \] We then discretize all the spaces using a uniform grid of size \(δ\). Pick \[ B = 10, \quad δ = 0.1 \]
Solve the truncated and discretized model using value iteration until the span seminorm between the successive iterates is less than \(10e^{-6}\). Benchmark the number of iterations, memory usage, and the time needed to compute the solution.
Solve the truncated and discretized model using policy iteration. Benchmark the number of iterations, memory usage, and the time needed to compute the solution.
Solve the truncated and discretized model using optimistic policy iteration. Pick \(\ell = 20\) and stop the algorithm when the span semi-norm of \(V^{(k+1)} - V^{(k)}\) is less than \(10e^{-6}\). Benchmark the number of iterations, memory usage, and the time needed to compute the solution.
Solve the truncated and discretized model using linear programming. Benchmark the memory usage and time needed to compute the solution.
Notes
The value iteration algorithm is due to Bellman (1957). The policy improvement and policy iteration algorithms are due to Howard (1960). The equivalence of policy improvement and the Newton-Raphson algorithm was demonstrated in the LQ case by Whittle and Komarova (1988), for which it holds in a tighter sense.