14 Computational complexity of value interation
How many computations are needed to run the value or policy iteration algorithm to obtain a policy that in within \(ε\) of the optimal? In this section, we provide an answer for this question for the value iteration algorithm.
Conisder an MDP with a finite state space \(\ALPHABET S = \{1, \dots, n \}\) with a finite non-empty set of actions \(\ALPHABET A(s)\) available at each \(s \in \ALPHABET S\). Each action set consists of \(M_s\) actions with a total number of \(M = \sum_{s=1}^n M_s\) actions. This number can be interpreted as the total number of state-action pairs. Let \(γ \in (0, 1)\) be the discount factor. We present the value iteration algorithm from the notes on MDP algorithms to achieve a pre-specified accuracy \(ε\).
As established in the notes on MDP algorithms, the algorithm stops in a finite number of iterations and the resulting policy is \(ε\)-optimal. engine
14.1 Span norm contraction
Let \(\SPAN(v) = \max(v) - \min(v)\) denotes the span semi-norm. We start by stating some basic properties of span semi-norm.
- \(\SPAN(v) \ge 0\) for all \(v \in \reals^n\)
- \(\SPAN(v + w) \le \SPAN(v) + \SPAN(w)\) for all \(v, w \in \reals^n\).
- \(\SPAN(m v) \le |m| \SPAN(v)\) for all \(v \in \reals^n\) and \(m \in \reals\).
- \(\SPAN(v + m \mathbf{1}) = \SPAN(v)\) for all \(m \in \reals\).
- \(\SPAN(v) = \SPAN(-v)\).
- \(\SPAN(v) \le 2\|v\|\).
Properties 1–3 imply that \(\SPAN(v)\) is a semi-norm. However, it is not a norm because of property 4; that is, \(\SPAN(v) = 0\) does not imply that \(v = 0\). If \(\SPAN(v) = 0\), then \(v = m \mathbf{1}\) for some scalar \(m\).
A basic result for our analysis is the following:
Proposition 14.1 Let \(v \in \reals^n\) and \(P\) be any matrix of compatible dimensions. Then, \[ \SPAN(P v) \le β_P \SPAN(v), \] where \[\begin{equation} \label{eq:span-matrix} β_P = 1 - \min_{s, s' \in \ALPHABET S} \sum_{z \in \ALPHABET S} \min\{ P_{sz}, P_{s'z} \}. \end{equation}\] Furhermore, \(β_P \in [0, 1]\) and there exists a \(v \in \reals^n\) such that \(\SPAN(Pv) = β_P \SPAN(v)\).
Proposition 14.1 illustrates the “averaging” property of a transition matrix. By multiplying a vector by a transition matrix, the resulting vector has components which are more nearly equal. When \(P\) is a square matrix, the quantity \(β_P\) is called the ergodicity coefficient, which is often written in an alternative form by using the relation \(|a - b| = (a + b) - 2 \min(a,b)\): \[ β_P = \frac12 \max_{s,s' \in \ALPHABET S} \sum_{z \in \ALPHABET S} \bigl| P_{sz} - P_{s'z} \bigr|. \] The ergodicity coefficient is an upper bound on the second largest eigenvalue of \(P\). \(β_P\) equals \(0\) if all rows of \(P\) are equal and equals \(1\) if at least two rows of \(P_d\) are orthogonal. From a different perspective, \(β_P < 1\) if for each pair of states there exists at least one state which they both can reach with positive probability in one step.
Define the contraction factor \[\begin{equation}\label{eq:contraction} β = \max_{\substack{ s,s' \in \ALPHABET S \\ a \in \ALPHABET A(s), w \in \ALPHABET A(s')}} \biggl[ 1 - \sum_{z \in \ALPHABET S} \min\{ P(z | s,a), P(z | s', w) \biggr]. \end{equation}\] Note that \(β \in [0, 1]\).
Theorem 14.1 For any \(V_1, V_2 \in \reals^n\), \[ \SPAN(\mathcal B V_1 - \mathcal B V_2) \le γ β\, \SPAN(V_1 - V_2). \]
14.2 Computational complexity of value iteration
Note that \(β \in [0, 1]\). We will first rule out the case \(β = 0\). If \(β = 0\), then \(P(z | s, a) = P(z | s', w)\) for all \(s, s', z \in \ALPHABET S\) and \(a \in \ALPHABET A(s)\) and \(w \in \ALPHABET A(s')\), which implies that all deterministic policies have the same transition probabilities. Therefore, a deterministic policy \(π\) is optimal if and only if it minimize the one-step cost. Thus, the case with \(β = 0\) is trivial. So, in the rest of the analysis, we assume that \(β \in (0, 1)\).
Theorem 14.2 Start with an abritrary \(V_0\). If \(Δ_1 = 0\), then we obtain an optimal policy in iteration 1. Otherwise, for any \(ε > 0\), value iteration finds an \(ε\)-optimal policy in no more than \(K^*(γ)\) iterations, where \[\begin{equation}\label{eq:K*} K^*(γ) = \left\lceil \frac{ \log \frac{(1-γ) ε β}{Δ_1} } {\log(γβ)} \right\rceil. \end{equation}\] In addition, each iteration uses at most \(O(nM)\) operations.
14.3 The bound may be exact
We now present an example (due to Feinberg and He (2020)) to show that the bound in Theorem 14.2 may be exact. Consider an MDP with \(\ALPHABET S = \{1, 2, 3\}\), \(\ALPHABET A = \{1,2 \}\), with \(\ALPHABET A(1) = \{1, 2\}\) and \(\ALPHABET A(2) = \ALPHABET A(3) = \{1\}\). The per-step transitions are \[ P(1) = \MATRIX{0 & 0 & 1 \\ 0 & 1 & 0 \\ 0 & 0 & 1 }, \quad\text{and}\quad P(2) = \MATRIX{0 & 1 & 0 \\ * & * & * \\ * & * & *}, \] where \(*\) indicates that the corresponding action is infeasible. The reward matrix is \[ r = \MATRIX{1 & 0 \\ 1 & * \\ -1 & *}. \]
Suppose we start with an initial \(V_0 = \VEC(1, 2, -2)\). Then elementary calculations show that \[ V_k = \MATRIX{ γ^k \\ γ^k \\ -γ^k} + \MATRIX{ \sum_{\ell = {\color{red} 1}}^{k} γ^\ell \\ \sum_{\ell = 0}^{k} γ^\ell \\ \sum_{\ell = 0}^{k} γ^\ell }. \] Thus, \[ V_k - V_{k-1} = \MATRIX{ 2 γ^k - γ^{k-1} \\ 2 γ^k - γ^{k-1} \\ - 2 γ^k + γ^{k-1} }. \] Hence, \[ \SPAN(V_k - V_{k-1}) = 2γ^{k-1} |2γ - 1| = γ^{k-1} \SPAN(V_1 - V_0). \]
Thus, for this model, the expression \eqref{eq:K*} is exact.
Notes
The discussion on span semi-norm and Theorem 14.1 is from Puterman (2014). Theorem 14.2 is from Feinberg and He (2020).