# ECSE 506: Stochastic Control and Decision Theory

Theory: Computational complexity of value iteration

How many computations are needed to run the value or policy iteration algorithm to obtain a policy that in within $$ε$$ of the optimal? In this section, we provide an answer for this question for the value iteration algorithm.

Conisder an MDP with a finite state space $$\ALPHABET S = \{1, \dots, n \}$$ with a finite non-empty set of actions $$\ALPHABET A(s)$$ available at each $$s \in \ALPHABET S$$. Each action set consists of $$M_s$$ actions with a total number of $$M = \sum_{s=1}^n M_s$$ actions. This number can be interpreted as the total number of state-action pairs. Let $$γ \in (0, 1)$$ be the discount factor. We present the value iteration algorithm from the notes on discounted MDP to achieve a pre-specified accuracy $$ε$$.

Value Iteration Algorithm

For given MDP, discount factor $$γ \in (0,1)$$, and constant $$ε > 0$$

1. Start with any $$V_0 \in \reals^n$$.
2. Recursively compute $$V_{k} = \mathcal B V_{k-1} = \mathcal B_{π_{k-1}} V_{k-1}.$$
3. Define $Δ_k = \max\{ V_k(s) - V_{k-1}(s) \} - \min \{ V_k(s) - V_{k-1}(s) \}.$
4. If $$Δ_k > \frac{1 - γ}{γ} ε$$, continue to step 2. Else carry out one more step of value update and choose the policy $$π_k$$.

As established in the notes on discounted MDP, the algorithm stops in a finite number of iterations and the resulting policy is $$ε$$-optimal.

# 1 Span norm contraction

Let $$\SPAN(v) = \max(v) - \min(v)$$ denotes the span semi-norm. We start by stating some basic properties of span semi-norm.

1. $$\SPAN(v) \ge 0$$ for all $$v \in \reals^n$$
2. $$\SPAN(v + w) \le \SPAN(v) + \SPAN(w)$$ for all $$v, w \in \reals^n$$.
3. $$\SPAN(m v) \le |m| \SPAN(v)$$ for all $$v \in \reals^n$$ and $$m \in \reals$$.
4. $$\SPAN(v + m \mathbf{1}) = \SPAN(v)$$ for all $$m \in \reals$$.
5. $$\SPAN(v) = \SPAN(-v)$$.
6. $$\SPAN(v) \le 2\|v\|$$.

Properties 1–3 imply that $$\SPAN(v)$$ is a semi-norm. However, it is not a norm because of property 4; that is, $$\SPAN(v) = 0$$ does not imply that $$v = 0$$. If $$\SPAN(v) = 0$$, then $$v = m \mathbf{1}$$ for some scalar $$m$$.

A basic result for our analysis is the following:

Prop. 1

Let $$v \in \reals^n$$ and $$P$$ be any matrix of compatible dimensions. Then, $\SPAN(P v) \le β_P \SPAN(v),$ where $$$\label{eq:span-matrix} β_P = 1 - \min_{s, s' \in \ALPHABET S} \sum_{z \in \ALPHABET S} \min\{ P_{sz}, P_{s'z} \}.$$$ Furhermore, $$β_P \in [0, 1]$$ and there exists a $$v \in \reals^n$$ such that $$\SPAN(Pv) = β_P \SPAN(v)$$.

Prop. 1 illustrates the “averaging” property of a transition matrix. By multiplying a vector by a transition matrix, the resulting vector has components which are more nearly equal. When $$P$$ is a square matrix, the quantity $$β_P$$ is called the ergodicity coefficient, which is often written in an alternative form by using the relation $$|a - b| = (a + b) - 2 \min(a,b)$$: $β_P = \frac12 \max_{s,s' \in \ALPHABET S} \sum_{z \in \ALPHABET S} \bigl| P_{sz} - P_{s'z} \bigr|.$ The ergodicity coefficient is an upper bound on the second largest eigenvalue of $$P$$. $$β_P$$ equals $$0$$ if all rows of $$P$$ are equal and equals $$1$$ if at least two rows of $$P_d$$ are orthogonal. From a different perspective, $$β_P < 1$$ if for each pair of states there exists at least one state which they both can reach with positive probability in one step.

Remark 1

Note that $β_P \le 1 - \sum_{z \in \ALPHABET S} \min_{s \in \ALPHABET S} P_{sz} =: β_P'$ which is easier to compute.

#### Proof

Note that for any $$v \in \reals^n$$ \begin{align*} \SPAN(Pv) &= \max_{s \in \ALPHABET S} \sum_{z \in \ALPHABET S} P_{sz} v_z - \min_{s' \in \ALPHABET S} \sum_{z \in \ALPHABET S} P_{s'z} v_z \\ &= \max_{s, s' \in \ALPHABET S} \sum_{z \in \ALPHABET S} P_{sz} v_z - \sum_{z \in \ALPHABET S} P_{s'z} v_z \end{align*} Let $$B(z; s,s') = \min\{ P_{sz}, P_{s'z} \}$$. Then consider \begin{align*} \sum_{z \in \ALPHABET S} P_{sz} v_z - \sum_{z \in \ALPHABET S} P_{s'z} v_z &= \sum_{z \in \ALPHABET S} [ P_{sz} - B(z; s,s') ] v_z - \sum_{z \in \ALPHABET S} [ P_{s'z} - B(z; s,s') ] v_z \\ &\le \sum_{z \in \ALPHABET S} [ P_{sz} - B(z; s,s') ] \max(v) - \sum_{z \in \ALPHABET S} [ P_{s'z} - B(z; s,s') ] \min(v) \\ &= \biggl[ 1 - \sum_{z \in \ALPHABET S} B(z; s, s') \biggr] \SPAN(v). \end{align*} Hence, $\SPAN(Pv) \le \max_{s, s' \in \ALPHABET S} \biggl[ 1 - \sum_{z \in \ALPHABET S} B(z; s, ) \biggr] \SPAN(v).$

Now we show that there exists a $$v$$ such that \eqref{eq:span-matrix} holds with equality. If $$β_P = 0$$, then $$P$$ has equal rows, so that $$\SPAN(Pv) = 0 = 0 \cdot \SPAN(v)$$ for all $$v \in \reals^n$$. Suppose $$β_P > 0$$. Using the identity $$[ a - b]^{+} = a - \min(a,b)$$, we can write $β_P = \max_{s,s' \in \ALPHABET S} \sum_{z \in \ALPHABET S} \bigl[ P_{sz} - P_{s'z} \bigr]^{+}.$ Let $$s^*$$ and $$s'^*$$ be such that $β_P = \sum_{z \in \ALPHABET S} \sum_{z \in \ALPHABET S} \bigl[ P_{s^*z} - P_{s'^*z} \bigr]^{+}.$ Define $$v$$ by $v_z = \IND\{ P_{s^*z} > P_{s'^*z} \}.$ Then, note that $$\SPAN(v) = 1$$ and \begin{align*} \SPAN(Pv) &\ge \sum_{z \in \ALPHABET S} P_{s^* z} v_z - \sum_{z \in \ALPHABET S} P_{s'^* z} v_z \\ &= \sum_{z \in \ALPHABET S} \bigl[ P_{s^*z} - P_{s'^*z} \bigr]^{+} \\ &= β_P \SPAN(v). \end{align*} Combining with \eqref{eq:span-matrix}, we get $$\SPAN(Pv) = β_P \SPAN(v)$$$$\Box$$

Define the contraction factor $$$\label{eq:contraction} β = \max_{\substack{ s,s' \in \ALPHABET S \\ a \in \ALPHABET A(s), w \in \ALPHABET A(s')}} \biggl[ 1 - \sum_{z \in \ALPHABET S} \min\{ P(z | s,a), P(z | s', w) \biggr].$$$ Note that $$β \in [0, 1]$$.

Theorem 1

For any $$V_1, V_2 \in \reals^n$$, $\SPAN(\mathcal B V_1 - \mathcal B V_2) \le γ β\, \SPAN(V_1 - V_2).$

#### Proof

Let $$π_i$$ be such that $$\mathcal B V_i = \mathcal B_{π_i}V_i$$. Let \begin{align*} s^* &= \arg \max_{s \in \ALPHABET S}(\mathcal B V_1(s) - \mathcal B V_2(s)), \\ s_* &= \arg \min_{s \in \ALPHABET S}(\mathcal B V_1(s) - \mathcal B V_2(s)), \end{align*} Then, $\mathcal B V_1(s^*) - \mathcal B V_2(s^*) \le \mathcal B_{π_2} V_1(s^*) - \mathcal B_{π_2} V_2(s^*) = γ P_{π_2}(V_1 - V_2)(s^*)$ and $\mathcal B V_1(s_*) - \mathcal B V_2(s_*) \ge \mathcal B_{π_1} V_1(s^*) - \mathcal B_{π_1} V_2(s^*) = γ P_{π_1}(V_1 - V_2)(s^*).$ Therefore, \begin{align*} \SPAN(\mathcal B V_1 - \mathcal B V_2) &\le γ P_{π_2}(V_1 - V_2)(s^*) - γ P_{π_1}(V_1 - V_2)(s_*) \\ &\le \max_{s \in \ALPHABET S} γ P_{π_2} (V_1 - V_2)(s) - \min_{s \in \ALPHABET S} γ P_{π_1}(V_1 - V_2)(s) \\ &\le \SPAN(γ\, \ROWS(P_{π_2}, P_{π_1})(V_1 - V_2). \end{align*} By applying Prop. 1, we get $\SPAN(\mathcal B V_1 - \mathcal B V_2) \le γβ_{\bar P} \SPAN(V_1 - V_2),$ where $$β_{\bar P}$$ is given by \eqref{eq:span-matrix} with $$\bar P = \ROWS(P_{π_2}, P_{π_1})$$. The result follows by noting that $$β_{\bar P}$$ is at most $$β$$$$\Box$$

# 2 Computational complexity of value iteration

Note that $$β \in [0, 1]$$. We will first rule out the case $$β = 0$$. If $$β = 0$$, then $$P(z | s, a) = P(z | s', w)$$ for all $$s, s', z \in \ALPHABET S$$ and $$a \in \ALPHABET A(s)$$ and $$w \in \ALPHABET A(s')$$, which implies that all deterministic policies have the same transition probabilities. Therefore, a deterministic policy $$π$$ is optimal if and only if it minimize the one-step cost. Thus, the case with $$β = 0$$ is trivial. So, in the rest of the analysis, we assume that $$β \in (0, 1)$$.

Theorem 2

Start with an abritrary $$V_0$$. If $$Δ_1 = 0$$, then we obtain an optimal policy in iteration 1. Otherwise, for any $$ε > 0$$, value iteration finds an $$ε$$-optimal policy in no more than $$K^*(γ)$$ iterations, where $$$\label{eq:K*} K^*(γ) = \left\lceil \frac{ \log \frac{(1-γ) ε β}{Δ_1} } {\log(γβ)} \right\rceil.$$$ In addition, each iteration uses at most $$O(nM)$$ operations.

#### Proof

If follows from the definition of Bellman operator that each iteration uses at most $$O(nM)$$ iterations (to compute the $$Q$$ function, for each state-action pair, we need to compute a sum over $$n$$ terms).

From Theorem 1, we get that $$Δ_k \le (γβ)^{k-1} Δ_1$$. Therefore, the minimum number of iterations required to achieve $$Δ_k \le \frac{1-γ}{γ}ε$$ is given $$K^*(γ)$$$$\Box$$

Remark 2

Note that finding the value of $$β$$ requires computing the sum in \eqref{eq:contraction} for all couples $$\{ (s,a), (s',w) \}$$ of state action pairs such that $$(s,a) \neq (s',w)$$. The total number of such pairs are $$M(M-1)/2 = O(M^2)$$. Therefore, the number of arithematic operators in \eqref{eq:contraction}, which are additions, is $$n$$ for each couple. Therefore, computation of $$β$$ requires $$O(nM^2)$$ operations, which can be significantly larger than the complexity of computing an $$ε$$-optimal policy which is given by Theorem 2! Based on Remark 1, we can replace $$β$$ by in \eqref{eq:K*} by $β' = 1 - \sum_{z \in \ALPHABET S} \min_{s \in \ALPHABET S, a \in \ALPHABET A} P(z | s,a)$ which requies $$O(nM)$$ operations.

## 2.1 The bound may be exact

We now present an example (due to Feinberg and He (2020)) to show that the bound in Theorem 2 may be exact. Consider an MDP with $$\ALPHABET S = \{1, 2, 3\}$$, $$\ALPHABET A = \{1,2 \}$$, with $$\ALPHABET A(1) = \{1, 2\}$$ and $$\ALPHABET A(2) = \ALPHABET A(3) = \{1\}$$. The per-step transitions are $P(1) = \MATRIX{0 & 0 & 1 \\ 0 & 1 & 0 \\ 0 & 0 & 1 }, \quad\text{and}\quad P(2) = \MATRIX{0 & 1 & 0 \\ * & * & * \\ * & * & *},$ where $$*$$ indicates that the corresponding action is infeasible. The reward matrix is $r = \MATRIX{1 & 0 \\ 1 & * \\ -1 & *}.$

Suppose we start with an initial $$V_0 = \VEC(1, 2, -2)$$. Then elementary calculations show that $V_k = \MATRIX{ γ^k \\ γ^k \\ -γ^k} + \MATRIX{ \sum_{\ell = {\color{red} 1}}^{k} γ^\ell \\ \sum_{\ell = 0}^{k} γ^\ell \\ \sum_{\ell = 0}^{k} γ^\ell }.$ Thus, $V_k - V_{k-1} = \MATRIX{ 2 γ^k - γ^{k-1} \\ 2 γ^k - γ^{k-1} \\ - 2 γ^k + γ^{k-1} }.$ Hence, $\SPAN(V_k - V_{k-1}) = 2γ^{k-1} |2γ - 1| = γ^{k-1} \SPAN(V_1 - V_0).$

Thus, for this model, the expression \eqref{eq:K*} is exact.

Remark 3

The exact number of iterations need not be monotone in $$γ$$! In the above example, let $$ε = 0.02$$, then $K^*(0.24) = 3, \quad K^*(0.47) = 4, \quad K^*(0.48) = 3.$

Thus, the number of iterations is not monotone in $$γ$$.

# References

The discussion on span semi-norm and Theorem 1 is from Puterman (2014). Theorem 2 is from Feinberg and He (2020).

Feinberg, E.A. and He, G. 2020. Complexity bounds for approximately solving discounted MDPs by value iterations. Operations Research Letters. DOI: 10.1016/j.orl.2020.07.001.
Puterman, M.L. 2014. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons. DOI: 10.1002/9780470316887.

This entry was last updated on 12 Mar 2022 and posted in MDP and tagged infinite horizon, discounted cost, computational complexity, value iteration.