ECSE 506: Stochastic Control and Decision Theory
Aditya Mahajan
Winter 2022
About  Lectures  Notes  Coursework
\(\def\ONES{\mathbb{1}}\)
A common way to approximate systems that run for a very large horizon is to assume that they run for an infinite horizon. There is an inherent homogeneity over time for infinite horizon system: the future depends only on the current state and not on the current time. Due to this homogeneity over time, we expect that the optimal policy should also be timehomogeneous. Therefore, the optimal policy for an infinitehorizon system should be easier to implement than the optimal policy for a finite horizon system, especially so when the horizon is large. This is one of the motivations for studying infinite horizon models.
The most common formulation for infinite horizon models is the discounted setup, where the cost function is assumed to be \[ J(π) = \EXP\Bigl[ \sum_{t=1}^\infty γ^{t1} c_t(S_t, A_t) \Bigr] \] where \(γ \in (0,1)\) is called the discount factor.
There are two interpretations of the discount factor \(γ\). The first interpretation is an economic interpretation to determine the present value of a utility that will be received in the future. For example, suppose a decision maker is indifferent between receiving 1 dollar today or \(s\) dollars tomorrow. This means that the decision maker discounts the future at a rate \(1/s\), so \(γ = 1/s\).
The second interpretation is that of an absorbing state. Suppose we are operating a machine that generates a value of $1 each day. However, there is a probability \(p\) that the machine will break down at the end of the day. Thus, the expected return for today is $1 while the expected return for tomorrow is \((1p)\) (which is the probability that the machine is still working tomorrow). In this case, the discount factor is defined as \((1p)\). See Shwartz (2001) for a detailed discussion of this alternative.
In the remainder of this section, we will study how to obtain a solution for such infinite horizon discounted cost models.
Note: Throughout this section, we assume that \(\ALPHABET S\) and \(\ALPHABET A\) are finite and \(\ALPHABET S= n\) and \(\ALPHABET A = m\).
1 Performance of a timehomogeneous Markov policy
For any \(π \colon \ALPHABET S \to \ALPHABET A\), consider the time homogeneous policy \((π, π, \dots)\). For ease of notation, we denote this policy simply by \(π\). The expected discounted cost under this policy is given by \[ V_π(s) = \EXP^π\bigg[ \sum_{t=1}^∞ γ^{t1} c(S_t, A_t) \biggm S_1 = s \bigg].\]
To get a compact expression for this, define a \(n × 1\) vector \(c_π\) and a \(n × n\) matrix \(P_π\) as follows: \[ [c_π]_s = c(s, π(s)) \quad\text{and}\quad [P_π]_{ss'} = P_{ss'}(π(s)). \] Then the dynamics under policy \(π\) are Markovian with transition probability matrix \(P_π\) and a cost function \(c_π\). Then \[ \begin{align*} \EXP^π\big[ c(S_t, π(S_t)) \bigm S_1 = s \big] &= \sum_{s' \in \ALPHABET S} \PR^π(S_t = s'  S_1 = s) c(s', π(s')) \\ &= \sum_{s' \in \ALPHABET S} [P_π^{t1}]_{ss'} [c_π]_y \\ &= δ_s P_π^{t1} c_π. \end{align*} \]
Let \(V_π\) denote the \(n × 1\) vector given by \([V_π]_s = V_π(s)\). Then, \[ \begin{align*} V_π &= c_π + γ P_π c_π + γ^2 P_π^2 c_π + \cdots \\ &= c_π + γ P_π \big( c_π + γ P_π c_π + \cdots \big) \\ &= c_π + γ P_π V_π, \end{align*} \] which can be rewritten as \[ (I  γ P_π) V_π = c_π. \]
The spectral radius \(ρ(γ P_d)\) of a matrix is upper bounded by its spectral norm \(\lVert γ P_d \rVert = γ < 1\). Therefore, the matrix \((I  γ P_π)\) has an inverse and by left multiplying both sides by \((I  γ P_π)^{1}\), we get \[ V_π = (I  γP_π)^{1} c_π. \]
The equation \[ V_π = c_π + γ P_π V_π \] is sometimes also written as \[ V_π = \mathcal B_π V_π \] where the operator \(\mathcal B_π\), which is called the Bellman operator, is an operator from \(\reals^n\) to \(\reals^n\) given by \[ \mathcal B_π v = c_π + γ P_π v.\]
2 Bellman operators
 Definition

Define the Bellman operator \(\mathcal B : \reals^n \to \reals^n\) as follows: for any \(v \in \reals^n\) \[ [\mathcal B v]_s = \min_{a \in \ALPHABET A} \Big\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v_y \Big\}. \]
Note that the above may also be written as^{1} \[ \mathcal B v = \min_{π \in \Pi} \mathcal B_π v, \] where \(\Pi\) denotes the set of all deterministic Markov policies.
 Prop. 1

For any \(v \in \reals^n\), define the norm \(\NORM{V} := \sup_{s \in \ALPHABET S} \ABS{V_s}\). Then, the Bellman operator is a contraction, i.e., for any \(v, w \in \reals^n\), \[ \NORM{\mathcal B v  \mathcal B w} \le γ \NORM{v  w}. \]
Proof
Fix a state \(s \in \ALPHABET S\) and consider \([\mathcal B v](s)  [\mathcal B w](s)\). In particular, let \(a^*\) be the optimal action in the right hand side of \([\mathcal B w](s)\). Then, \[\begin{align*} [\mathcal B v  \mathcal B w](s) &= \min_{a \in \ALPHABET A}\bigl\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v(s') \bigr\}  \min_{a \in \ALPHABET A}\bigl\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) w(s') \bigr\} \\ &\le c(s,a^*) + γ \sum_{s'\in \ALPHABET S} P_{ss'}(a^*) v(s')  c(s,a^*)  γ \sum_{s'\in \ALPHABET S} P_{ss'}(a^*) w(s') \\ &\le γ \sum_{s' \in \ALPHABET S} P_{ss'}(a^*) \ v  w \ \\ &= γ \ v  w \. \end{align*} \]
By a similar argument, we can show that \([\mathcal B w  \mathcal B v](s) \le γ \ v  w \\), which proves the other side of the inequality.
An immediate consequence of the contraction property is the following.
 Theorem 1

There is a unique bounded \(V^* \in \reals^n\) that satisfies the Bellman equation \[ V = \mathcal B V \]
Moreover, if we start from any \(V_0 \in \reals^n\) and recursively define \[ V_n = \mathcal B V_{n1} \] then \[ \lim_{n \to ∞} V_n = V^*. \]
Proof
This follows immediately from the Banach fixed point theorem.
3 Optimal timehomogeneous policy
 Prop. 2

Define \[ V^{opt}_∞(s) := \min_{π} \EXP^π \bigg[ \sum_{t=1}^∞ γ^{t1} c(S_t, A_t) \biggm S_1 = s \bigg], \] where the minimum is over all (possibly randomized) history dependent policies. Then, \[ V^{opt}_∞ = V^*, \] where \(V^*\) is the solution of the Bellman equation.
Proof
Since the state and action space are finite, without loss of optimality, we can assume that \(0 \le c(s,a) \le M\).
Consider the finite horizon truncation \[ V^{opt}_T(s) = \min_{π} \EXP^π\bigg[ \sum_{t=1}^T γ^{t1} c(S_t, A_t)  S_1 = s \bigg]. \] From the results for finite horizon MDP, we have that \[ V^{opt}_T = \mathcal B^T V_0 \] where \(V_0\) is the all zeros vector.
Now by construction, \[V^{opt}_∞(s) \ge V^{opt}_T(s) = [\mathcal B^T V_0](s). \] Taking limit as \(T \to ∞\), we get that \[\begin{equation}\label{eq:1} V^{opt}_∞(s) \ge \lim_{T \to ∞} [\mathcal B^T V_0](s) = V^*(s). \end{equation}\]
Since \(0 \le c(s,a) \le M\), for any \(T\), \[ \begin{align*} V^{opt}_∞(s) &\le \min_π \EXP^π \bigg[ \sum_{t=1}^T γ^{t1} c(S_t, A_t) \biggm S_1 = s \bigg] + \sum_{t=T+1}^∞ γ^{t1} M \\ &= V^{opt}_T(s) + γ^T M / (1  γ) \\ &= [\mathcal B^T V_0](s) + γ^T M / (1γ). \end{align*} \] Taking limit as \(T \to ∞\), we get that \[\begin{equation}\label{eq:2} V^{opt}_∞(s) \le \lim_{T \to ∞} \big\{ [\mathcal B^T V_0](s) + γ^T M / (1γ) \big\} = V^*(s). \end{equation}\]
From \eqref{eq:1} and \eqref{eq:2}, we get that \(V^{opt}_∞ = V^*\).
4 Properties of Bellman operator
 Prop. 3

The Bellman operator satisfies the following properties
 Monotonicity. For any \(v, w \in \reals^n\), if \(v \le w\), then \(\mathcal B_π v \le \mathcal B_π w\) and \(\mathcal B v \le \mathcal B w\).
 Discounting. For any \(v \in \reals^n\) and \(m \in \reals\), \(\mathcal B_π (v + m \ONES) = \mathcal B_π v + γ m \ONES\) and \(\mathcal B (v + m \ONES) = \mathcal B v + γ m \ONES\).
Proof of monotonicity property
Recall that \[ \mathcal B_π v = c_π + γ P_π v. \] So, monotonicity of \(\mathcal B_π\) follows immediately from monotonicity of matrix multiplication for positive matrices.
Let \(μ\) be such that \(\mathcal B w = \mathcal B_μ w\). Then, \[ \mathcal B v \le \mathcal B_μ v \stackrel{(a)} \le \mathcal B_μ w = \mathcal B w, \] where \((a)\) uses the monotonicity of \(\mathcal B_μ\).
Proof of discounting property
Recall that \[ \mathcal B_π v = c_π + γ P_π v. \] Thus, \[ \mathcal B_π(v+m \ONES) = c_π + γ P_π (v+m \ONES) = c_π + γ P_π v + γ m \ONES = \mathcal B_π v + γ m \ONES.\] Thus, \(\mathcal B_π\) is discounting. Now consider \[ \mathcal B (v + m \ONES ) = \min_{π} \mathcal B_π (v+m \ONES) = \min_π \mathcal (B_π v + γ m \ONES) = \mathcal B v + γ m \ONES.\] Thus, \(\mathcal B\) is discounting.
 Prop. 4

For any \(V \in \reals^n\),
 If \(V \ge \mathcal B V\), then \(V \ge V^*\);
 If \(V \le \mathcal B V\), then \(V \le V^*\);
 If \(V = \mathcal B V\), then \(V\) is the only vector with this property and \(V = V^*\).
The same bounds are true when \((\mathcal B, V^*)\) is replaced with \((\mathcal B_π, V_π)\).
Proof
We prove the first part. The proof of the other parts is similar.
We are given that \[V \ge \mathcal B V.\] Then, by monotonicity of the Bellman operator, \[ \mathcal B V \ge \mathcal B^2 V.\] Continuing this way, we get \[ \mathcal B^k V \ge \mathcal B^{k+1} V.\] Adding the above equations, we get \[ V \ge \mathcal B^{k+1} V.\] Taking limit as \(k \to ∞\), we get \[V \ge V^*.\]
 Prop. 5

For any \(V \in \reals^n\) and \(m \in \reals\),
 If \(V + m \ONES \ge \mathcal B V\), then \(V + m \ONES/(1γ) \ge V^*\);
 If \(V + m \ONES \le \mathcal B V\), then \(V + m \ONES/(1γ) \le V^*\);
The same bounds are true when \((\mathcal B, V^*)\) is replaced with \((\mathcal B_π, V_π)\).
 Remark

The above result can also be stated as follows:
 \(\displaystyle \ V_π  V \ \le \frac{1}{1γ}\ \mathcal B_π V  V \\).
 \(\displaystyle \ V^*  V \ \le \frac{1}{1γ}\ \mathcal B V  V \\).
Proof
Again, we only prove the first part. The proof of the second part is the same. We have that \[ V + m \ONES \ge \mathcal B V. \] From discounting and monotonicity properties, we get \[ \mathcal B V + γ m \ONES \ge \mathcal B^2 V. \] Again, from discounting and monotonitiy properties, we get \[ \mathcal B^2 V + γ^2 m \ONES \ge \mathcal B^3 V. \] Continuing this way, we get \[ \mathcal B^k V + γ^k m \ONES \ge \mathcal B^{k+1} V. \] Adding all the above equations, we get \[ V + \sum_{\ell = 0}^k γ^\ell m \ONES \ge \mathcal B^{k+1} V. \] Taking the limit as \(k \to ∞\), we get \[ V + m \ONES/(1γ) \ge V^*. \]
5 Value Iteration Algorithm
Value Iteration Algorithm
 Start with any \(V_0 \in \reals^n\).
 Recursively compute \(V_{k+1} = \mathcal B V_k = \mathcal B_{π_k} V_k.\)
 Define \[ \begin{align*} \underline δ_k &= \frac{γ}{1γ} \min_s \{ V_k(s)  V_{k1}(s) \}, \\ \bar δ_k &= \frac{γ}{1γ} \max_s \{ V_k(s)  V_{k1}(s) \}. \end{align*} \]
Then, for all \(k\)
 \(V_k + \underline δ_k \ONES \le V^* \le V_k + \bar δ_k \ONES\).
 \((\underline δ_k  \bar δ_k) \ONES \le V_{π_k}  V^* \le (\bar δ_k  \underline δ_k) \ONES.\)
Proof
By construction, \[ \begin{align*} \mathcal B V_k  V_k &= \mathcal B V_k  \mathcal B V_{k1} \\ & \le \mathcal B_{π_{k1}} V_k  \mathcal B_{π_{k1}} V_{k1}\\ & \le γ P_{π_{k1}}[ V_k  V_{k1} ] \\ &= (1γ) \bar δ_k \ONES. \end{align*} \] Thus, by Prop. 5, we have \[\begin{equation} \label{eq:VI1} V^* \le V_k + \bar δ_k \ONES. \end{equation}\] Note that \(\mathcal B V_k = \mathcal B_{π_k} V_k\). So, we have also show that \(\mathcal B_{π_k} V_k  V_k \le (1γ) \bar δ_k \ONES\). Thus, again by Prop. 5, we have \[\begin{equation}\label{eq:VI2} V_{π_k} \le V_k + \bar δ_k \ONES. \end{equation}\]
By a similar argument, we can show \[\begin{equation}\label{eq:VI3} V^* \ge V_k + \underline δ_k \ONES \quad\text{and}\quad V_{π_k} \ge V_k + \underline δ_k \ONES. \end{equation}\]
Eq. \eqref{eq:VI1} and \eqref{eq:VI3} imply the first relationship of the result. To establish the second relationship, note that the triangle inequality \[ \max\{ V_{π_k}  V^* \} \le \max\{ V_{π_k}  V_k \} + \max\{ V_{k}  V^* \} \le (\bar δ_k  \underline δ_k) \ONES. \] Similarly, \[ \max\{ V^*  V_{π_k} \} \le \max \{ V^*  V_k \} + \max\{ V_k  V_{π_k} \} \le (\bar δ_k  \underline δ_k) \ONES. \] Combining the above two equation, we get the second relationship of the result. \(\Box\)
6 Policy Iteration Algorithm
 Lemma 1

Policy Improvement Suppose \(V_π\) is the fixed point of \(\mathcal B_π\) and \[\mathcal B_{μ} V_π = \mathcal B V_π. \] Then, \[V_{μ}(s) \le V_π(s), \quad \forall s \in \ALPHABET S. \] Moreover, if \(π\) is not optimal, then at least one inequality is strict.
Proof
\[ V_π = \mathcal B_π V_π \ge \mathcal B V_π = \mathcal B_{μ} V_π.\] Thus, \[ V_π \ge V_{μ}. \]
Finally, suppose \(V_μ = V_π\). Then, \[ V_μ = \mathcal B_μ V_μ = \mathcal B_μ V_π = \mathcal B V_π = \mathcal B V_μ. \] Thus, \(V_μ\) (and \(V_π\)) is the unique fixed point of \(\mathcal B\). Hence \(μ\) and \(π\) are optimal. \(\Box\)
Policy Iteration Algorithm
Start with an arbitrary policy \(π_0\). Compute \(V_0 = \mathcal B_{π_0} V_0\).
Recursively compute a policy \(π_k\) such that \[\mathcal B V_{k1} = \mathcal B_{π_k} V_{k1}\] and compute the performance of the policy using \[ V_k = \mathcal B_{π_k} V_k.\]
Stop if \(V_k = V_{k1}\) (or \(π_k = π_{k1}\)).
The policy improvement lemma (Lemma 1) implies that \(V_{k1} \ge V_k\). Since the state and action spaces are finite, there are only a finite number of policies. The value function improves at each step. So the process must converge in finite number of iterations. At convergence, \(V_k = V_{k1}\) and the policy improvement lemma implies that the corresponding policies \(π_k\) or \(π_{k1}\) are optimal.
6.1 Policy iteration as NewtonRaphson algoritm
Recall that the main idea behind NewtonRaphson is as follows. Suppose we want to solve a fixed point equation \(V = \mathcal B V\) and we have an approximate solution \(V_k\). Then we can search for an improved soluiton \(V_{k+1} = V_k + Δ_k\) by setting \[\begin{equation} \label{eq:NR} V_k + Δ_k = \mathcal{B}( V_k + Δ_k ), \end{equation} \] expanding the righthand side as far as firstorder terms in \(Δ_k\) and solving the consequent linear equation for \(Δ_k\).
Now, let’s try to apply this idea to find the fixed point of the Bellman equation. Suppose we have identified a guess \(V_k\) and \(\mathcal B V_k = \mathcal B_{π_{k+1}} V_k\). Because the choice of control action \(a\) is optimization out in \(\mathcal B\), the varation of \(a\) induced by the variation \(Δ_k\) of \(V_k\) has no firstorder effect on the value of \(\mathcal B(V_k + Δ_k)\). Therefore, \[ \mathcal{B}(V_k + Δ_k) = \mathcal B_{π_{k+1}}(V_k + Δ_k) + o(Δ_k). \] It follows that the linearized version of \eqref{eq:NR} is just \[ V_{k+1} = \mathcal B_{π_{k+1}} V_{k+1}. \] That is, \(V_{k+1}\) is just the value function for the policy \(π_{k+1}\), where \(π_{k+1}\) was deduced from the value function \(V_k\) exactly by the policy improvement procedure. Therefore, we can conclude the following.
 Theorem 1

The policy improvement algorithm is equivalent to the application of NewtonRaphson algorithm to the fixed point equation \(V = \mathcal B V\) of dynamic programming.
The equivalence between policy iteration and NewtonRaphson partily explains why policy iteration approaches converge in few iterations.
7 Optimistic Policy Iteration
Optimistic Policy Iteration Algorithm
Fix a sequence of integers \(\{\ell_k\}_{k \in \integers_{\ge 0}}\).
Start with an initial guess \(V_0 \in \reals^n\).
For \(k=0, 1, 2, \dots\), recursively compute a policy \(π_k\) such that \[ \mathcal B_{π_k} V_k = \mathcal B V_k \] and then update the value function \[ V_{k+1} = \mathcal B_{π_k}^{\ell_k} V_k. \]
Note that if \(\ell_k = 1\), the optimistic policy iteration is equivalent to value iteration and if \(\ell_k = \infty\), then optimistic policy iteration is equal to policy iteration.
In the remainder of this section, we state the modifications of the main results to establish the convergence bounds for optimistic policy iteration.
 Prop. 6

For any \(V \in \reals^n\) and \(m \ONES \in \reals_{\ge 0}\)
 If \(V + m \ONES \ge \mathcal B V = \mathcal B_π V\), then for any \(\ell \in \integers_{> 0}\), \[ \mathcal B V + \frac{γ}{1  γ} m \ONES \ge \mathcal B_π^\ell V \] and \[ \mathcal B_π^\ell V + γ^\ell m \ONES \ge \mathcal B( \mathcal B_π^\ell V). \]
The proof is left as an exercise.
 Prop. 7

Let \(\{(V_k, π_k)\}_{k \ge 0}\) be generated as per the optimistic policy iteration algorithm. Define \[ \alpha_k = \begin{cases} 1, & \text{if } k = 0 \\ γ^{\ell_0 + \ell_1 + \dots + \ell_{k1}}, & \text{if } k > 0. \end{cases}\] Suppose there exists an \(m \in \reals\) such that \[ \ V_0  \mathcal B V_0 \ \le m. \] Then, for all \(k \ge 0\), \[ \mathcal B V_{k+1}  \alpha_{k+1} m \le V_{k+1} \le \mathcal B V_k + \frac{γ}{1γ} \alpha_k m. \] Moreover, \[ V_{k}  \frac{(k+1) γ^k}{1γ} m \le V^* \le V_k + \frac{\alpha_k}{1  γ} m \le V_k + \frac{γ^k}{1  γ} m. \]
Exercises
Show that the error bound for value iteration is monotone with the number of iterations, i.e, \[ V_k + \underline δ_k \ONES \le V_{k+1} + \underline δ_{k+1} \ONES \le V^* \le V_{k+1} + \bar δ_{k+1} \ONES \le V_k + \bar δ_k \ONES. \]
Onestep lookahead error bounds.
Given any \(V \in \reals^n\), let \(π\) be such that \(\mathcal B V = \mathcal B_π V\). Moreover, let \(V^*\) denote the unique fixed point of \(\mathcal B\) and \(V_π\) denote the unique fixed point of \(\mathcal B_π\). Then, show that
 \[ \ V^*  V \ \le \frac{1}{1γ} \ \mathcal B V  V \. \]
 \[ \ V^*  \mathcal B V \ \le \frac{γ}{1γ} \ \mathcal B V  V \. \]
 \[ \ V_π  V \ \le \frac{1}{1γ} \ \mathcal B_π V  V \. \]
 \[ \ V_π  \mathcal B_π V \ \le \frac{γ}{1γ} \ \mathcal B_π V  V \. \]
 \[ \ V_π  V^* \ \le \frac{2}{1γ} \ \mathcal B V  V \. \]
 \[ \ V_π  V^* \ \le \frac{2γ}{1  γ} \ V  V^* \. \]
References
The material included here is referenced from different sources. Perhaps the best sources to study this material are the books by Puterman (2014), Whittle (1982), and Bertsekas (2011).
The techniques for value iteration and policy improvement were formalized by Howard (1960). The equivalence of policy improvement and the NewtonRaphson algorithm was demonstrated in the LQ case by Whittle and Komarova (1988), for which it holds in a tighter sense.
This is true for general models only when the arg min at each state exists.↩︎
This entry was last updated on 10 Nov 2022 and posted in MDP and tagged infinite horizon, discounted cost, bellman operator, value iteration, policy iteration.