ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

Theory: Infinite horizon discounted MDP


A common way to approximate systems that run for a very large horizon is to assume that they run for an infinite horizon. There is an inherent homogeneity over time for infinite horizon system: the future depends only on the current state and not on the current time. Due to this homogeneity over time, we expect that the optimal policy should also be time-homogeneous. Therefore, the optimal policy for an infinite-horizon system should be easier to implement than the optimal policy for a finite horizon system, especially so when the horizon is large. This is one of the motivations for studying infinite horizon models.

The most common formulation for infinite horizon models is the discounted setup, where the cost function is assumed to be \[ J(π) = \EXP\Bigl[ \sum_{t=1}^\infty γ^{t-1} c_t(S_t, A_t) \Bigr] \] where \(γ \in (0,1)\) is called the discount factor.

There are two interpretations of the discount factor \(γ\). The first interpretation is an economic interpretation to determine the present value of a utility that will be received in the future. For example, suppose a decision maker is indifferent between receiving 1 dollar today or \(s\) dollars tomorrow. This means that the decision maker discounts the future at a rate \(1/s\), so \(γ = 1/s\).

The second interpretation is that of an absorbing state. Suppose we are operating a machine that generates a value of $1 each day. However, there is a probability \(p\) that the machine will break down at the end of the day. Thus, the expected return for today is $1 while the expected return for tomorrow is \((1-p)\) (which is the probability that the machine is still working tomorrow). In this case, the discount factor is defined as \((1-p)\). See Shwartz (2001) for a detailed discussion of this alternative.

In the remainder of this section, we will study how to obtain a solution for such infinite horizon discounted cost models.

Note: Throughout this section, we assume that \(\ALPHABET S\) and \(\ALPHABET A\) are finite and \(|\ALPHABET S|= n\) and \(|\ALPHABET A| = m\).

1 Performance of a time-homogeneous Markov policy

For any \(π \colon \ALPHABET S \to \ALPHABET A\), consider the time homogeneous policy \((π, π, \dots)\). For ease of notation, we denote this policy simply by \(π\). The expected discounted cost under this policy is given by \[ V_π(s) = \EXP^π\bigg[ \sum_{t=1}^∞ γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg].\]

To get a compact expression for this, define a \(n × 1\) vector \(c_π\) and a \(n × n\) matrix \(P_π\) as follows: \[ [c_π]_s = c(s, π(s)) \quad\text{and}\quad [P_π]_{ss'} = P_{ss'}(π(s)). \] Then the dynamics under policy \(π\) are Markovian with transition probability matrix \(P_π\) and a cost function \(c_π\). Then \[ \begin{align*} \EXP^π\big[ c(S_t, π(S_t)) \bigm| S_1 = s \big] &= \sum_{s' \in \ALPHABET S} \PR^π(S_t = s' | S_1 = s) c(s', π(s')) \\ &= \sum_{s' \in \ALPHABET S} [P_π^{t-1}]_{ss'} [c_π]_y \\ &= δ_s P_π^{t-1} c_π. \end{align*} \]

Let \(V_π\) denote the \(n × 1\) vector given by \([V_π]_s = V_π(s)\). Then, \[ \begin{align*} V_π &= c_π + γ P_π c_π + γ^2 P_π^2 c_π + \cdots \\ &= c_π + γ P_π \big( c_π + γ P_π c_π + \cdots \big) \\ &= c_π + γ P_π V_π, \end{align*} \] which can be rewritten as \[ (I - γ P_π) V_π = c_π. \]

The spectral radius \(ρ(γ P_d)\) of a matrix is upper bounded by its spectral norm \(\lVert γ P_d \rVert = γ < 1\). Therefore, the matrix \((I - γ P_π)\) has an inverse and by left multiplying both sides by \((I - γ P_π)^{-1}\), we get \[ V_π = (I - γP_π)^{-1} c_π. \]

The equation \[ V_π = c_π + γ P_π V_π \] is sometimes also written as \[ V_π = \mathcal B_π V_π \] where the operator \(\mathcal B_π\), which is called the Bellman operator, is an operator from \(\reals^n\) to \(\reals^n\) given by \[ \mathcal B_π v = c_π + γ P_π v.\]

2 Bellman operators


Define the Bellman operator \(\mathcal B : \reals^n \to \reals^n\) as follows: for any \(v \in \reals^n\) \[ [\mathcal B v]_s = \min_{a \in \ALPHABET A} \Big\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v_y \Big\}. \]

Note that the above may also be written as1 \[ \mathcal B v = \min_{π \in \Pi} \mathcal B_π v, \] where \(\Pi\) denotes the set of all deterministic Markov policies.

Prop. 1

For any \(v \in \reals^n\), define the norm \(\NORM{V} := \sup_{s \in \ALPHABET S} \ABS{V_s}\). Then, the Bellman operator is a contraction, i.e., for any \(v, w \in \reals^n\), \[ \NORM{\mathcal B v - \mathcal B w} \le γ \NORM{v - w}. \]


Fix a state \(s \in \ALPHABET S\) and consider \([\mathcal B v](s) - [\mathcal B w](s)\). In particular, let \(a^*\) be the optimal action in the right hand side of \([\mathcal B w](s)\). Then, \[\begin{align*} [\mathcal B v - \mathcal B w](s) &= \min_{a \in \ALPHABET A}\bigl\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v(s') \bigr\} - \min_{a \in \ALPHABET A}\bigl\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) w(s') \bigr\} \\ &\le c(s,a^*) + γ \sum_{s'\in \ALPHABET S} P_{ss'}(a^*) v(s') - c(s,a^*) - γ \sum_{s'\in \ALPHABET S} P_{ss'}(a^*) w(s') \\ &\le γ \sum_{s' \in \ALPHABET S} P_{ss'}(a^*) \| v - w \| \\ &= γ \| v - w \|. \end{align*} \]

By a similar argument, we can show that \([\mathcal B w - \mathcal B v](s) \le γ \| v - w \|\), which proves the other side of the inequality.

An immediate consequence of the contraction property is the following.

Theorem 1

There is a unique bounded \(V^* \in \reals^n\) that satisfies the Bellman equation \[ V = \mathcal B V \]

Moreover, if we start from any \(V_0 \in \reals^n\) and recursively define \[ V_n = \mathcal B V_{n-1} \] then \[ \lim_{n \to ∞} V_n = V^*. \]


This follows immediately from the Banach fixed point theorem.

3 Optimal time-homogeneous policy

Prop. 2

Define \[ V^{opt}_∞(s) := \min_{π} \EXP^π \bigg[ \sum_{t=1}^∞ γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg], \] where the minimum is over all (possibly randomized) history dependent policies. Then, \[ V^{opt}_∞ = V^*, \] where \(V^*\) is the solution of the Bellman equation.


Since the state and action space are finite, without loss of optimality, we can assume that \(0 \le c(s,a) \le M\).

Consider the finite horizon truncation \[ V^{opt}_T(s) = \min_{π} \EXP^π\bigg[ \sum_{t=1}^T γ^{t-1} c(S_t, A_t) | S_1 = s \bigg]. \] From the results for finite horizon MDP, we have that \[ V^{opt}_T = \mathcal B^T V_0 \] where \(V_0\) is the all zeros vector.

Now by construction, \[V^{opt}_∞(s) \ge V^{opt}_T(s) = [\mathcal B^T V_0](s). \] Taking limit as \(T \to ∞\), we get that \[\begin{equation}\label{eq:1} V^{opt}_∞(s) \ge \lim_{T \to ∞} [\mathcal B^T V_0](s) = V^*(s). \end{equation}\]

Since \(0 \le c(s,a) \le M\), for any \(T\), \[ \begin{align*} V^{opt}_∞(s) &\le \min_π \EXP^π \bigg[ \sum_{t=1}^T γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg] + \sum_{t=T+1}^∞ γ^{t-1} M \\ &= V^{opt}_T(s) + γ^T M / (1 - γ) \\ &= [\mathcal B^T V_0](s) + γ^T M / (1-γ). \end{align*} \] Taking limit as \(T \to ∞\), we get that \[\begin{equation}\label{eq:2} V^{opt}_∞(s) \le \lim_{T \to ∞} \big\{ [\mathcal B^T V_0](s) + γ^T M / (1-γ) \big\} = V^*(s). \end{equation}\]

From \eqref{eq:1} and \eqref{eq:2}, we get that \(V^{opt}_∞ = V^*\).

4 Properties of Bellman operator

Prop. 3

The Bellman operator satisfies the following properties

  • Monotonicity. For any \(v, w \in \reals^n\), if \(v \le w\), then \(\mathcal B_π v \le \mathcal B_π w\) and \(\mathcal B v \le \mathcal B w\).
  • Discounting. For any \(v \in \reals^n\) and \(m \in \reals\), \(\mathcal B_π (v + m \ONES) = \mathcal B_π v + γ m \ONES\) and \(\mathcal B (v + m \ONES) = \mathcal B v + γ m \ONES\).

Proof of monotonicity property

Recall that \[ \mathcal B_π v = c_π + γ P_π v. \] So, monotonicity of \(\mathcal B_π\) follows immediately from monotonicity of matrix multiplication for positive matrices.

Let \(μ\) be such that \(\mathcal B w = \mathcal B_μ w\). Then, \[ \mathcal B v \le \mathcal B_μ v \stackrel{(a)} \le \mathcal B_μ w = \mathcal B w, \] where \((a)\) uses the monotonicity of \(\mathcal B_μ\).

Proof of discounting property

Recall that \[ \mathcal B_π v = c_π + γ P_π v. \] Thus, \[ \mathcal B_π(v+m \ONES) = c_π + γ P_π (v+m \ONES) = c_π + γ P_π v + γ m \ONES = \mathcal B_π v + γ m \ONES.\] Thus, \(\mathcal B_π\) is discounting. Now consider \[ \mathcal B (v + m \ONES ) = \min_{π} \mathcal B_π (v+m \ONES) = \min_π \mathcal (B_π v + γ m \ONES) = \mathcal B v + γ m \ONES.\] Thus, \(\mathcal B\) is discounting.

Prop. 4

For any \(V \in \reals^n\),

  • If \(V \ge \mathcal B V\), then \(V \ge V^*\);
  • If \(V \le \mathcal B V\), then \(V \le V^*\);
  • If \(V = \mathcal B V\), then \(V\) is the only vector with this property and \(V = V^*\).

The same bounds are true when \((\mathcal B, V^*)\) is replaced with \((\mathcal B_π, V_π)\).


We prove the first part. The proof of the other parts is similar.

We are given that \[V \ge \mathcal B V.\] Then, by monotonicity of the Bellman operator, \[ \mathcal B V \ge \mathcal B^2 V.\] Continuing this way, we get \[ \mathcal B^k V \ge \mathcal B^{k+1} V.\] Adding the above equations, we get \[ V \ge \mathcal B^{k+1} V.\] Taking limit as \(k \to ∞\), we get \[V \ge V^*.\]

Prop. 5

For any \(V \in \reals^n\) and \(m \in \reals\),

  • If \(V + m \ONES \ge \mathcal B V\), then \(V + m \ONES/(1-γ) \ge V^*\);
  • If \(V + m \ONES \le \mathcal B V\), then \(V + m \ONES/(1-γ) \le V^*\);

The same bounds are true when \((\mathcal B, V^*)\) is replaced with \((\mathcal B_π, V_π)\).


The above result can also be stated as follows:

  • \(\displaystyle \| V_π - V \| \le \frac{1}{1-γ}\| \mathcal B_π V - V \|\).
  • \(\displaystyle \| V^* - V \| \le \frac{1}{1-γ}\| \mathcal B V - V \|\).


Again, we only prove the first part. The proof of the second part is the same. We have that \[ V + m \ONES \ge \mathcal B V. \] From discounting and monotonicity properties, we get \[ \mathcal B V + γ m \ONES \ge \mathcal B^2 V. \] Again, from discounting and monotonitiy properties, we get \[ \mathcal B^2 V + γ^2 m \ONES \ge \mathcal B^3 V. \] Continuing this way, we get \[ \mathcal B^k V + γ^k m \ONES \ge \mathcal B^{k+1} V. \] Adding all the above equations, we get \[ V + \sum_{\ell = 0}^k γ^\ell m \ONES \ge \mathcal B^{k+1} V. \] Taking the limit as \(k \to ∞\), we get \[ V + m \ONES/(1-γ) \ge V^*. \]

5 Value Iteration Algorithm

Value Iteration Algorithm

  1. Start with any \(V_0 \in \reals^n\).
  2. Recursively compute \(V_{k+1} = \mathcal B V_k = \mathcal B_{π_k} V_k.\)
  3. Define \[ \begin{align*} \underline δ_k &= \frac{γ}{1-γ} \min_s \{ V_k(s) - V_{k-1}(s) \}, \\ \bar δ_k &= \frac{γ}{1-γ} \max_s \{ V_k(s) - V_{k-1}(s) \}. \end{align*} \]

Then, for all \(k\)

  1. \(V_k + \underline δ_k \ONES \le V^* \le V_k + \bar δ_k \ONES\).
  2. \((\underline δ_k - \bar δ_k) \ONES \le V_{π_k} - V^* \le (\bar δ_k - \underline δ_k) \ONES.\)


By construction, \[ \begin{align*} \mathcal B V_k - V_k &= \mathcal B V_k - \mathcal B V_{k-1} \\ & \le \mathcal B_{π_{k-1}} V_k - \mathcal B_{π_{k-1}} V_{k-1}\\ & \le γ P_{π_{k-1}}[ V_k - V_{k-1} ] \\ &= (1-γ) \bar δ_k \ONES. \end{align*} \] Thus, by Prop. 5, we have \[\begin{equation} \label{eq:VI-1} V^* \le V_k + \bar δ_k \ONES. \end{equation}\] Note that \(\mathcal B V_k = \mathcal B_{π_k} V_k\). So, we have also show that \(\mathcal B_{π_k} V_k - V_k \le (1-γ) \bar δ_k \ONES\). Thus, again by Prop. 5, we have \[\begin{equation}\label{eq:VI-2} V_{π_k} \le V_k + \bar δ_k \ONES. \end{equation}\]

By a similar argument, we can show \[\begin{equation}\label{eq:VI-3} V^* \ge V_k + \underline δ_k \ONES \quad\text{and}\quad V_{π_k} \ge V_k + \underline δ_k \ONES. \end{equation}\]

Eq. \eqref{eq:VI-1} and \eqref{eq:VI-3} imply the first relationship of the result. To establish the second relationship, note that the triangle inequality \[ \max\{ V_{π_k} - V^* \} \le \max\{ V_{π_k} - V_k \} + \max\{ V_{k} - V^* \} \le (\bar δ_k - \underline δ_k) \ONES. \] Similarly, \[ \max\{ V^* - V_{π_k} \} \le \max \{ V^* - V_k \} + \max\{ V_k - V_{π_k} \} \le (\bar δ_k - \underline δ_k) \ONES. \] Combining the above two equation, we get the second relationship of the result. \(\Box\)

6 Policy Iteration Algorithm

Lemma 1

Policy Improvement Suppose \(V_π\) is the fixed point of \(\mathcal B_π\) and \[\mathcal B_{μ} V_π = \mathcal B V_π. \] Then, \[V_{μ}(s) \le V_π(s), \quad \forall s \in \ALPHABET S. \] Moreover, if \(π\) is not optimal, then at least one inequality is strict.


\[ V_π = \mathcal B_π V_π \ge \mathcal B V_π = \mathcal B_{μ} V_π.\] Thus, \[ V_π \ge V_{μ}. \]

Finally, suppose \(V_μ = V_π\). Then, \[ V_μ = \mathcal B_μ V_μ = \mathcal B_μ V_π = \mathcal B V_π = \mathcal B V_μ. \] Thus, \(V_μ\) (and \(V_π\)) is the unique fixed point of \(\mathcal B\). Hence \(μ\) and \(π\) are optimal.  \(\Box\)

Policy Iteration Algorithm

  1. Start with an arbitrary policy \(π_0\). Compute \(V_0 = \mathcal B_{π_0} V_0\).

  2. Recursively compute a policy \(π_k\) such that \[\mathcal B V_{k-1} = \mathcal B_{π_k} V_{k-1}\] and compute the performance of the policy using \[ V_k = \mathcal B_{π_k} V_k.\]

  3. Stop if \(V_k = V_{k-1}\) (or \(π_k = π_{k-1}\)).

The policy improvement lemma (Lemma 1) implies that \(V_{k-1} \ge V_k\). Since the state and action spaces are finite, there are only a finite number of policies. The value function improves at each step. So the process must converge in finite number of iterations. At convergence, \(V_k = V_{k-1}\) and the policy improvement lemma implies that the corresponding policies \(π_k\) or \(π_{k-1}\) are optimal.

6.1 Policy iteration as Newton-Raphson algoritm

Recall that the main idea behind Newton-Raphson is as follows. Suppose we want to solve a fixed point equation \(V = \mathcal B V\) and we have an approximate solution \(V_k\). Then we can search for an improved soluiton \(V_{k+1} = V_k + Δ_k\) by setting \[\begin{equation} \label{eq:NR} V_k + Δ_k = \mathcal{B}( V_k + Δ_k ), \end{equation} \] expanding the right-hand side as far as first-order terms in \(Δ_k\) and solving the consequent linear equation for \(Δ_k\).

Now, let’s try to apply this idea to find the fixed point of the Bellman equation. Suppose we have identified a guess \(V_k\) and \(\mathcal B V_k = \mathcal B_{π_{k+1}} V_k\). Because the choice of control action \(a\) is optimization out in \(\mathcal B\), the varation of \(a\) induced by the variation \(Δ_k\) of \(V_k\) has no first-order effect on the value of \(\mathcal B(V_k + Δ_k)\). Therefore, \[ \mathcal{B}(V_k + Δ_k) = \mathcal B_{π_{k+1}}(V_k + Δ_k) + o(Δ_k). \] It follows that the linearized version of \eqref{eq:NR} is just \[ V_{k+1} = \mathcal B_{π_{k+1}} V_{k+1}. \] That is, \(V_{k+1}\) is just the value function for the policy \(π_{k+1}\), where \(π_{k+1}\) was deduced from the value function \(V_k\) exactly by the policy improvement procedure. Therefore, we can conclude the following.

Theorem 1

The policy improvement algorithm is equivalent to the application of Newton-Raphson algorithm to the fixed point equation \(V = \mathcal B V\) of dynamic programming.

The equivalence between policy iteration and Newton-Raphson partily explains why policy iteration approaches converge in few iterations.

7 Optimistic Policy Iteration

Optimistic Policy Iteration Algorithm

  1. Fix a sequence of integers \(\{\ell_k\}_{k \in \integers_{\ge 0}}\).

  2. Start with an initial guess \(V_0 \in \reals^n\).

  3. For \(k=0, 1, 2, \dots\), recursively compute a policy \(π_k\) such that \[ \mathcal B_{π_k} V_k = \mathcal B V_k \] and then update the value function \[ V_{k+1} = \mathcal B_{π_k}^{\ell_k} V_k. \]

Note that if \(\ell_k = 1\), the optimistic policy iteration is equivalent to value iteration and if \(\ell_k = \infty\), then optimistic policy iteration is equal to policy iteration.

In the remainder of this section, we state the modifications of the main results to establish the convergence bounds for optimistic policy iteration.

Prop. 6

For any \(V \in \reals^n\) and \(m \ONES \in \reals_{\ge 0}\)

  • If \(V + m \ONES \ge \mathcal B V = \mathcal B_π V\), then for any \(\ell \in \integers_{> 0}\), \[ \mathcal B V + \frac{γ}{1 - γ} m \ONES \ge \mathcal B_π^\ell V \] and \[ \mathcal B_π^\ell V + γ^\ell m \ONES \ge \mathcal B( \mathcal B_π^\ell V). \]

The proof is left as an exercise.

Prop. 7

Let \(\{(V_k, π_k)\}_{k \ge 0}\) be generated as per the optimistic policy iteration algorithm. Define \[ \alpha_k = \begin{cases} 1, & \text{if } k = 0 \\ γ^{\ell_0 + \ell_1 + \dots + \ell_{k-1}}, & \text{if } k > 0. \end{cases}\] Suppose there exists an \(m \in \reals\) such that \[ \| V_0 - \mathcal B V_0 \| \le m. \] Then, for all \(k \ge 0\), \[ \mathcal B V_{k+1} - \alpha_{k+1} m \le V_{k+1} \le \mathcal B V_k + \frac{γ}{1-γ} \alpha_k m. \] Moreover, \[ V_{k} - \frac{(k+1) γ^k}{1-γ} m \le V^* \le V_k + \frac{\alpha_k}{1 - γ} m \le V_k + \frac{γ^k}{1 - γ} m. \]


  1. Show that the error bound for value iteration is monotone with the number of iterations, i.e, \[ V_k + \underline δ_k \ONES \le V_{k+1} + \underline δ_{k+1} \ONES \le V^* \le V_{k+1} + \bar δ_{k+1} \ONES \le V_k + \bar δ_k \ONES. \]

  2. One-step look-ahead error bounds.

    Given any \(V \in \reals^n\), let \(π\) be such that \(\mathcal B V = \mathcal B_π V\). Moreover, let \(V^*\) denote the unique fixed point of \(\mathcal B\) and \(V_π\) denote the unique fixed point of \(\mathcal B_π\). Then, show that

    1. \[ \| V^* - V \| \le \frac{1}{1-γ} \| \mathcal B V - V \|. \]
    2. \[ \| V^* - \mathcal B V \| \le \frac{γ}{1-γ} \| \mathcal B V - V \|. \]
    3. \[ \| V_π - V \| \le \frac{1}{1-γ} \| \mathcal B_π V - V \|. \]
    4. \[ \| V_π - \mathcal B_π V \| \le \frac{γ}{1-γ} \| \mathcal B_π V - V \|. \]
    5. \[ \| V_π - V^* \| \le \frac{2}{1-γ} \| \mathcal B V - V \|. \]
    6. \[ \| V_π - V^* \| \le \frac{2γ}{1 - γ} \| V - V^* \|. \]


The material included here is referenced from different sources. Perhaps the best sources to study this material are the books by Puterman (2014), Whittle (1982), and Bertsekas (2011).

The techniques for value iteration and policy improvement were formalized by Howard (1960). The equivalence of policy improvement and the Newton-Raphson algorithm was demonstrated in the LQ case by Whittle and Komarova (1988), for which it holds in a tighter sense.

Bertsekas, D.P. 2011. Dynamic programming and optimal control. Athena Scientific. Available at:
Howard, R.A. 1960. Dynamic programming and markov processes. The M.I.T. Press.
Puterman, M.L. 2014. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons. DOI: 10.1002/9780470316887.
Shwartz, A. 2001. Death and discounting. IEEE Transactions on Automatic Control 46, 4, 644–647. DOI: 10.1109/9.917668.
Whittle, P. 1982. Optimization over time: Dynamic programming and stochastic control. Vol. 1 and 2. Wiley.
Whittle, P. and Komarova, N. 1988. Policy improvement and the newton-raphson algorithm. Probability in the Engineering and Informational Sciences 2, 2, 249–255. DOI: 10.1017/s0269964800000760.

  1. This is true for general models only when the arg min at each state exists.↩︎

This entry was last updated on 10 Nov 2022 and posted in MDP and tagged infinite horizon, discounted cost, bellman operator, value iteration, policy iteration.