15  Infinite horizon MDPs

Updated

February 23, 2026

Keywords

infinite horizon, discounted cost, Bellman operator

A common way to approximate systems that run for a very large horizon is to assume that they run for an infinite horizon. There is an inherent homogeneity over time for infinite horizon system: the future depends only on the current state and not on the current time. Due to this homogeneity over time, we expect that the optimal policy should also be time-homogeneous. Therefore, the optimal policy for an infinite-horizon system should be easier to implement than the optimal policy for a finite horizon system, especially so when the horizon is large. This is one of the motivations for studying infinite horizon models.

The most common formulation for infinite horizon models is the discounted setup, where the cost function is assumed to be \[ J(π) = \EXP^{π}\Bigl[ \sum_{t=1}^\infty γ^{t-1} c_t(S_t, A_t) \Bigr] \] where \(γ \in (0,1)\) is called the discount factor.

TipDiscount Factor

The idea of using discounting in MDPs is due to Blackwell (1965).

There are two interpretations of the discount factor \(γ\). The first interpretation is an economic interpretation to determine the present value of a utility that will be received in the future. For example, suppose a decision maker is indifferent between receiving 1 dollar today or \(s\) dollars tomorrow. This means that the decision maker discounts the future at a rate \(1/s\), so \(γ = 1/s\).

The second interpretation is that of an absorbing state. Suppose we are operating a machine that generates a value of $1 each day. However, there is a probability \(p\) that the machine will break down at the end of the day. Thus, the expected return for today is $1 while the expected return for tomorrow is \((1-p)\) (which is the probability that the machine is still working tomorrow). In this case, the discount factor is defined as \((1-p)\). See Shwartz (2001) for a detailed discussion of this alternative. See Sobel (2012) for an axiomatic rationale for discounting in stochastic framework and its relationship to risk sensitivity.

In the remainder of this section, we will study how to obtain a solution for such infinite horizon discounted cost models.

Note: Throughout this section, we assume that \(\ALPHABET S\) and \(\ALPHABET A\) are finite and \(|\ALPHABET S|= n\) and \(|\ALPHABET A| = m\).

Note Summary of the main result

The main result for infinite horizon MDPs is that there is no loss of optimality in restricting attention to time-homogeneous deterministic Markov policies. In other words, for any time-varying, possible history dependent and randomized policy, there is a time-homogeneous deterministic Markov policy that performs as well.

For finite horizon models, we have already established that there is no loss of optimality in restricting attention to deterministic Markov policies. So, the only new bit here is that for infinite horizon discounted cost models, we can restrict attention to time-homogeneous policies as well.

Intuitively, this seems obvious. The system from time 1 onwards looks the same as the system from time 2 onwards. So, the decision rule which is optimal at time 1 should also be optimal at time 2. This intuition is correct as long as the future behavior is stable, which is always the case for discounted models with bounded per-step cost. We formally establish this result below. Formally establishing this intuition is a bit more subtle in other models such as the infinite-horizon total cost, infinite-horizon long-term average cost, and even discounted cost models with unbounded per-step cost.

As an illustration, we solve Example 14.3 (call options problem) for a large horizon \(T = 100\) (with parameters \(p = 10\), \(\sigma = 0.5\), \(γ = 0.9\)). We plot the value function and policy for \(t = 1, \ldots, 10\).

(a) Value function
(b) Optimal policy
Figure 15.1: Finite-horizon call option with T=100: value and policy for t = 1,…,20 (value function changes little near the beginning)

The plot shows that \(V_t(s)\) and the optimal policy change very little over these first time steps showing that the solution is already close to time-homogeneous near the beginning. This motivates us to restrict our attention to time-homogeneous policies.

An infinite horizon Markov policy \((π_1, π_2, \ldots)\) is called time-homogeneous1 if \(π_1 = π_2 = π_3 = \cdots\). With a slight abuse of notation, we will denote a time-homogeneous policy \((π,π,\ldots)\) simply by \(π\).

1 Time-homogeneous policies are sometimes also called stationary policies in the literature. I personally feel that the term stationary can be a bit confusing because stationary processes has a precise but different meaning in probability theory.

The roadmap of the rest of this section is as follows:

  1. We first show that the performance of a time-homogeneous Markov policy can be computed by solving a system of linear equations.

  2. To shorten the notation and streamline the analysis, we introduce the Bellman operators and show that they are contractions under the sup-norm.

  3. We then use Banach fixed point theorem to show that the Bellman operators have unique fixed points.

  4. Finally, we show that the time-homoengeous policy that is greedywith respect to the fixed point of the Bellman operator is optimal.

15.1 Performance of a time-homogeneous Markov policy

How do we evaluate the performance of a time-homongeous policy? Unlike the finite horizon setting, we cannot simply iterate backwards from a terminal cost. Instead, we look for a self-consistent value.

The expected discounted cost of a time-homogeneous Markov policy is given by \[ V^π(s) = \EXP^π\bigg[ \sum_{t=1}^∞ γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg].\]

We now introduce a vector notation. Let \(V^π\) denote the \(n × 1\) vector given by \([V^π]_s = V^π(s)\). Furthermore, define the \(n × 1\) vector \(c_π\) and the \(n × n\) matrix \(P_π\) as follows: \[ [c_π]_s = c(s, π(s)) \quad\text{and}\quad [P_π]_{ss'} = P_{ss'}(π(s)). \]

Proposition 15.1 The performance of a time-homogeneous Markov policy may be computed by solving a system of linear equations given by: \[ V^π = (I - γ P_π)^{-1} c_π. \]

We first observe that the dynamics under policy \(π\) are Markovian with transition probability matrix \(P_π\) and a cost function \(c_π\). Therefore, \[ \begin{align*} \EXP^π\big[ c(S_t, π(S_t)) \bigm| S_1 = s \big] &= \sum_{s' \in \ALPHABET S} \PR^π(S_t = s' | S_1 = s) c(s', π(s')) \\ &= \sum_{s' \in \ALPHABET S} [P_π^{t-1}]_{ss'} [c_π]_{s'} \\ &= δ_s P_π^{t-1} c_π. \end{align*} \]

Now consider the performance of policy \(π\), which is given by \[ \begin{align*} V^π &= c_π + γ P_π c_π + γ^2 P_π^2 c_π + \cdots \\ &= c_π + γ P_π \big( c_π + γ P_π c_π + \cdots \big) \\ &= c_π + γ P_π V^π. \end{align*} \] Rearranging terms, we have \[ (I - γ P_π) V^π = c_π. \]

The :spectral radius \(ρ(γ P_π)\) of a matrix is upper bounded by its :spectral norm \(\lVert γ P_π \rVert = γ < 1\). Therefore, the matrix \((I - γ P_π)\) has an inverse and by left multiplying both sides by \((I - γ P_π)^{-1}\), we get \[ V^π = (I - γP_π)^{-1} c_π. \]

15.2 Bellman operators

To succinctly describe the analysis, we define two operators:

  1. The Bellman operator for policy \(π\) denoted by \(\BELLMAN^π\): For any \(v \in \reals^n\), \[ \BELLMAN^π v = c_{π} + γ P_{π} v. \]

  2. The Bellman optimality operator denoted by \(\BELLMAN^*\): For any \(v \in \reals^n\): \[ [\BELLMAN^* v]_s = \min_{a \in \ALPHABET A} \Big\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v(s') \Big\}.\] We say that a policy \(π\) is a greedy policy with respect to \(v\), written as \(π = \GREEDY(v)\), if \[ \BELLMAN^* v = \BELLMAN^π v. \]

Note that the Bellman optimality operator may also be written as2 \[ [\BELLMAN^* v] = \min_{π \in \Pi} [\BELLMAN^π v](s), \quad \forall s \in \ALPHABET S \] or more succinctly, \[ \BELLMAN^* v = \min_{π \in \Pi} \BELLMAN^π v, \] where \(\Pi\) denotes the set of all deterministic Markov policies.

2 This is true for general models only when the arg min at each state exists.

For any \(v \in \reals^n\), let \(\NORM{v}_{∞}\) denote the sup-norm of \(v\), i.e., \(\NORM{v}_{∞} := \sup_{s \in \ALPHABET S} \ABS{v(s)}\). Recall that \((\reals^n, \NORM{\cdot}_{∞})\) is a :Banach space

Proposition 15.2 The operators \(\BELLMAN^π\) and \(\BELLMAN^*\) are contractions under the sup-norm, i.e., for any \(v, w \in \reals^n\), \[ \NORM{\BELLMAN v - \BELLMAN w} \le γ \NORM{v - w}, \] where \(\BELLMAN \in \{\BELLMAN^π, \BELLMAN^*\}\).

First consider the case of \(\BELLMAN^π\). Note that \[ \BELLMAN^π v - \BELLMAN^π w = (c_{π} + γ P_{π} v) - (c_{π} + γ P_{π} w) = γ P_{π} (v - w). \] Therefore, \[ \NORM{\BELLMAN^π v - \BELLMAN^π w}_{∞} = \NORM{γ P_{π} (v - w)}_{∞} \le γ \NORM{v - w}_{∞} \NORM{P_{π} \ONES}_{∞} = γ \NORM{v - w}_{∞}. \]

Now consider the case of \(\BELLMAN^*\). Fix a state \(s \in \ALPHABET S\) and consider \([\BELLMAN^* v](s) - [\BELLMAN^* w](s)\). In particular, let \(a^*\) be the optimal action in the right hand side of \([\BELLMAN^* w](s)\). Then, \[\begin{align*} [\BELLMAN^* v - \BELLMAN^* w](s) &= \min_{a \in \ALPHABET A}\bigl\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v(s') \bigr\} - \min_{a \in \ALPHABET A}\bigl\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) w(s') \bigr\} \\ &\le c(s,a^*) + γ \sum_{s'\in \ALPHABET S} P_{ss'}(a^*) v(s') - c(s,a^*) - γ \sum_{s'\in \ALPHABET S} P_{ss'}(a^*) w(s') \\ &\le γ \sum_{s' \in \ALPHABET S} P_{ss'}(a^*) \NORM{v - w}_{∞} \\ &= γ \NORM{v - w}_{∞}. \end{align*} \]

By a similar argument, we can show that \([\BELLMAN^* w - \BELLMAN^* v](s) \le γ \NORM{v - w}_{∞}\), which proves the other side of the inequality.

Let’s revisit the result of Proposition 15.1, which may be written as \[ V^π = c_π + γ P_π V^π \] We can write the above using Bellman operators as: \[\begin{equation}\label{eq:policy-evaluation} V^π = \BELLMAN^π V^π. \end{equation}\]

Thus, we can equivalently state the result of Proposition 15.1 by saying that \(V^π\) is the unique bounded fixed point of \(\eqref{eq:policy-evaluation}\).

What about the Bellman optimality operator? Does the equation \(V^* = \BELLMAN V^*\) have a solution? The contraction property (Proposition 15.2) and the :Banach fixed point theorem imply the following.

Theorem 15.1 There is a unique bounded \(V^* \in \reals^n\) that satisfies the Bellman equation \[ V^* = \BELLMAN^* V^* \]

Moreover, if we start from any \(V^0 \in \reals^n\) and recursively define \(V^n = \BELLMAN^* V^{n-1}\) then \[ \lim_{n \to ∞} V^n = V^*. \]

15.3 Optimal time-homogeneous policy

What does \(V^*\) signify? What about a policy \(π^* \in \GREEDY(V^*)\)? We answer these questions below.

Proposition 15.3 Define \[ V^{\text{opt}}_∞(s) := \min_{π} \EXP^π \bigg[ \sum_{t=1}^∞ γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg], \] where the minimum is over all (possibly randomized) history dependent policies. Then, \[ V^{\text{opt}}_∞ = V^*, \] where \(V^*\) is the solution of the Bellman equation.

Since the state and action space are finite, without loss of optimality, we can assume that \(0 \le c(s,a) \le M\).

Consider the finite horizon truncation \[ V^{\text{opt}}_T(s) = \min_{π} \EXP^π\bigg[ \sum_{t=1}^T γ^{t-1} c(S_t, A_t) | S_1 = s \bigg]. \] From the results for finite horizon MDP, we have that \[ V^{\text{opt}}_T = (\BELLMAN^*)^T V^0 \] where \(V^0\) is the all zeros vector.

Now by construction, \[V^{\text{opt}}_∞(s) \ge V^{\text{opt}}_T(s) = [(\BELLMAN^*)^T V^0](s). \] Taking limit as \(T \to ∞\), we get that \[\begin{equation}\label{eq:1} V^{\text{opt}}_∞(s) \ge \lim_{T \to ∞} [(\BELLMAN^*)^T V^0](s) = V^*(s). \end{equation}\]

Since \(0 \le c(s,a) \le M\), for any \(T\), \[\begin{align*} V^{\text{opt}}_∞(s) &\le \min_π \EXP^π \bigg[ \sum_{t=1}^T γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg] + \sum_{t=T+1}^∞ γ^{t-1} M \\ &= V^{\text{opt}}_T(s) + γ^T M / (1 - γ) \\ &= [(\BELLMAN^*)^T V^0](s) + γ^T M / (1-γ). \end{align*}\] Taking limit as \(T \to ∞\), we get that \[\begin{equation}\label{eq:2} V^{\text{opt}}_∞(s) \le \lim_{T \to ∞} \big\{ [(\BELLMAN^*)^T V^0](s) + γ^T M / (1-γ) \big\} = V^*(s). \end{equation}\]

From \eqref{eq:1} and \eqref{eq:2}, we get that \(V^{\text{opt}}_∞ = V^*\).

Thus, \(V^*\) is the optimal performance. Now we can formally state the main result.

Theorem 15.2 A time-homogeneous Markov policy \(π^*\) is optimal if and only if it satisfies \(π^* \in \GREEDY(V^*)\), where \(V^*\) is the unique uniformly bounded solution of the following fixed point equation: \[ V^* = \BELLMAN^* V^*. \]

NoteProof

From Theorem 15.1, we know that the equation \(V^* = \BELLMAN^* V^*\) has a unique bounded solution \(V^*\). From Proposition 15.3, we know that \(V^*\) is the optimal value function.

Take a policy \(π^*\) such that \(π^* \in \GREEDY(V^*)\). Note that \[ V^* = \BELLMAN^* V^* = \BELLMAN^{π^*} V^*. \] Thus, \(V^*\) is a fixed point of \(\BELLMAN^{π^*}\) and therefore equal to \(V^{π^*}\). Thus, any policy that is greedy wrt \(V^*\) has performance \(V^*\) and (by Proposition 15.3) is optimal.

15.4 An example

As was the case in the finite horizon setting, in some instances it is possible to solve the Bellman equation explicitly.

Example 15.1 (Optimal Consumption) Consider an investor with wealth \(S_t \in \reals_{\ge 0}\). At each time step, the investor consumes an amount \(A_t \in (0, S_t]\) and invests the remainder in a risky asset with an i.i.d. random rate of return \(W_t > 0\). The wealth dynamics are given by: \[S_{t+1} = (S_t - A_t) W_t.\]

The investor wishes to maximize the expected discounted utility \[\EXP\biggl[\sum_{t=1}^\infty γ^{t-1} \log(A_t)\biggr].\] Find the optimal value function and the optimal consumption policy.

The Bellman optimality equation for maximizing rewards is: \[V(s) = \max_{0 < a \le s} \Big\{ \log(a) + γ \EXP\big[ V( (s-a)W ) \big] \Big\}.\]

We guess that the value function has the form \(V(s) = C \log s + D\). Substituting this into the Bellman equation: \[\begin{align*} C \log s + D &= \max_{0 < a \le s} \Big\{ \log a + γ \EXP[ C \log((s-a)W) + D ] \Big\} \\ &= \max_{0 < a \le s} \Big\{ \log a + γ C \log(s-a) + γ C \EXP[\log W] + γ D \Big\}. \end{align*}\] To find the optimal \(a\), we differentiate the term inside the max with respect to \(a\): \[\frac{1}{a} - \frac{γ C}{s-a} = 0 \implies s - a = γ C a \implies a = \frac{1}{1 + γ C} s.\] Thus, the optimal policy is linear in wealth: \(A^*(s) = \lambda s\). Substituting \(a = \lambda s\) back into the equation and matching coefficients for \(\log s\): \[\log(\lambda s) + γ C \log((1-\lambda)s) = (1 + γ C) \log s + \text{constants}.\] For the LHS to match \(C \log s\) on the RHS, we must have: \[1 + γ C = C \implies C = \frac{1}{1-γ}.\]

Substituting \(C\) back into the expression for \(a\): \[a = \frac{1}{1 + \frac{γ}{1-γ}} s = (1-γ)s.\]

To find the constant \(D\), we substitute the optimal policy \(a = (1-γ)s\) back into the Bellman equation. \[ C \log s + D = \log(1-γ)s + γ C \log γ s + γ C \EXP[\log W] + γ D \] We now equate the constant terms (those not attached to \(\log s\)): \[ D = \log(1-γ) + γ C \log γ + γ C \EXP[\log W] + γ D \] Solving for \(D\): \[D = \frac{\log(1-γ)}{1-γ} + \frac{γ \log γ }{(1-γ)^2} + \frac{\EXP[\log W]}{(1-γ)^2}\]

Result: The value function is of the form \(V(s) = C \log s + D\), with the values of \(C\) and \(D\) as given above. The optimal policy is \[π^*(s) = (1-γ)s.\] Thus, it is optimal to consume a constant fraction of wealth at every step, regardless of the distribution of \(W\):

Explicit closed-form solutions for infinite-horizon MDPs are rare; instead, we typically rely on the numerical algorithms discussed in the next section.

15.5 Properties of Bellman operator

Proposition 15.4 The Bellman operator satisfies the following properties

  • Monotonicity. For any \(v, w \in \reals^n\), if \(v \le w\), then \(\BELLMAN^π v \le \BELLMAN^π w\) and \(\BELLMAN^* v \le \BELLMAN^* w\).
  • Discounting. For any \(v \in \reals^n\) and \(m \in \reals\), \(\BELLMAN^π (v + m \ONES) = \BELLMAN^π v + γ m \ONES\) and \(\BELLMAN^* (v + m \ONES) = \BELLMAN^* v + γ m \ONES\).

Recall that \[ \BELLMAN^π v = c_π + γ P_π v. \] So, monotonicity of \(\BELLMAN^π\) follows immediately from monotonicity of matrix multiplication for positive matrices.

Let \(μ\) be such that \(\BELLMAN^* w = \BELLMAN^μ w\). Then, \[ \BELLMAN^* v \le \BELLMAN^μ v \stackrel{(a)} \le \BELLMAN^μ w = \BELLMAN^* w, \] where \((a)\) uses the monotonicity of \(\BELLMAN^μ\).

Recall that \[ \BELLMAN^π v = c_π + γ P_π v. \] Thus, \[ \BELLMAN^π(v+m \ONES) = c_π + γ P_π (v+m \ONES) = c_π + γ P_π v + γ m \ONES = \BELLMAN^π v + γ m \ONES.\] Thus, \(\BELLMAN^π\) is discounting. Now consider \[ \BELLMAN^* (v + m \ONES ) = \min_{π} \BELLMAN^π (v+m \ONES) = \min_π (\BELLMAN^π v + γ m \ONES) = \BELLMAN^* v + γ m \ONES.\] Thus, \(\BELLMAN^*\) is discounting.

Proposition 15.5 For any \(V \in \reals^n\),

  • If \(V \ge \BELLMAN^* V\), then \(V \ge V^*\);
  • If \(V \le \BELLMAN^* V\), then \(V \le V^*\);
  • If \(V = \BELLMAN^* V\), then \(V\) is the only vector with this property and \(V = V^*\).

The same bounds are true when \((\BELLMAN^*, V^*)\) is replaced with \((\BELLMAN^π, V^π)\).

We prove the first part. The proof of the other parts is similar.

We are given that \[V \ge \BELLMAN^* V.\] Then, by monotonicity of the Bellman operator, \[ \BELLMAN^* V \ge (\BELLMAN^*)^2 V.\] Continuing this way, we get \[ (\BELLMAN^*)^k V \ge (\BELLMAN^*)^{k+1} V.\] Adding the above equations, we get \[ V \ge (\BELLMAN^*)^{k+1} V.\] Taking limit as \(k \to ∞\), we get \[V \ge V^*.\]

Proposition 15.6 For any \(V \in \reals^n\) and \(m \in \reals\),

  • If \(V + m \ONES \ge \BELLMAN^* V\), then \(V + m \ONES/(1-γ) \ge V^*\);
  • If \(V + m \ONES \le \BELLMAN^* V\), then \(V + m \ONES/(1-γ) \le V^*\);

The same bounds are true when \((\BELLMAN^*, V^*)\) is replaced with \((\BELLMAN^π, V^π)\).

TipRemark

The above result can also be stated as follows:

  • \(\displaystyle \NORM{V^π - V}_{∞} \le \frac{1}{1-γ}\NORM{\BELLMAN^π V - V}_{∞}\).
  • \(\displaystyle \NORM{V^* - V}_{∞} \le \frac{1}{1-γ}\NORM{\BELLMAN^* V - V}_{∞}\).

Again, we only prove the first part. The proof of the second part is the same. We have that \[ V + m \ONES \ge \BELLMAN^* V. \] From discounting and monotonicity properties, we get \[ \BELLMAN^* V + γ m \ONES \ge (\BELLMAN^*)^2 V. \] Again, from discounting and monotonicity properties, we get \[ (\BELLMAN^*)^2 V + γ^2 m \ONES \ge (\BELLMAN^*)^3 V. \] Continuing this way, we get \[ (\BELLMAN^*)^k V + γ^k m \ONES \ge (\BELLMAN^*)^{k+1} V. \] Adding all the above equations, we get \[ V + \sum_{\ell = 0}^k γ^\ell m \ONES \ge (\BELLMAN^*)^{k+1} V. \] Taking the limit as \(k \to ∞\), we get \[ V + \frac{m}{1-γ} \ONES \ge V^*. \]

15.6 Countable or continuous state and action spaces

The discussion above was restricted to finite state and action spaces. In general, when working with non-finite state and action spaces, we need to be careful about existence of solutions. See Section 5.7 for a discussion.

For infinite horizon problems, there are additional technical challenges. We first point out via examples that the Bellman equation may not have a unique fixed point equation.

Example 15.2 (Phantom solutions of DP (Bertsekas 2011)) Consider an MDP with \(\ALPHABET S = [0,∞)\), \(\ALPHABET A = \{a_\circ\}\) (dummy action), \(p(ds'\mid s,a) = δ_{s' - s/γ}\) (so the state deterministically transitions to \(s/γ\)) and \(c(s,a) ≡ 0\). It is clear that the optimal value function \(V^*(s) ≡ 0\). However, the dynamic programming equation is \[ V^*(s) = γ V^*(s/γ) \] and is satisfied by any linear function \(V^*(s) = α s\)!

WarningDid we just violate the Banach fixed point theorem?

To apply the Banach fixed point theorem, we need to be in a Banach space, i.e., a complete normed metric space. So, what was the Banach space in Theorem 15.1? Implicitly, we had used the space \[ \ALPHABET V^M \coloneqq \{ v \colon \ALPHABET S \to \reals \text{ such that } \NORM{v}_{∞} \le M \} \] where \(M\) is any constant greater than \(\NORM{c}_{∞}/(1-γ)\). Banach fixed point theorem says that DP has a unique fixed point in \(\ALPHABET V^M\). It does not say anything about solutions outside \(\ALPHABET V^M\)!

Among all the solutions, only the one corresponding to \(α = 0\) is bounded, which is also the optimal value function.

Example 15.3 Consider a birth-death Markov chain with \(0\) as an absorbing state. In particular, \(\ALPHABET S = \integers_{\ge 0}\), \(\ALPHABET A = \{a_\circ\}\) (dummy action), and \(c(s,a) = \IND\{s > 0\}\). The transition dynamics are given by: for \(s = 0\), \(P(s'|0,a) = \IND\{s' = 0\}\) and for \(s > 0\), we have \[ P(s'|s,a) = p \IND\{ s' = s+1\} + (1-p) \IND\{s' = s-1\}. \] So, the DP is given by \(v(0) = 0\) and for \(s > 0\) \[ V^*(s) = 1 + γ p V^*(s+1) + γ(1-p) V^*(s-1), \] which is a harmonic equation and has a generic solution of the form: \[ V^*(s) = \frac{1}{1-γ} - \biggl[ \frac{1}{1-γ} + C \biggr] q_1^s + C q_2^s \] where \(C\) is an arbitrary constant and \[ q_{1,2} = \frac{1 \pm \sqrt{1 - 4 γ^2 p (1-p)}}{2 γp}. \] It can be verified that \(q_1 \in (0,1)\) and \(q_2 \in (1,∞)\). So, every solution except \(C = 0\) is unbounded.

Again we have multiple solutions of the DP and the correct solution is the only bounded solution (which corresponds to \(C = 0\)).

Exercises

Exercise 15.1 (One-step look-ahead error bounds.) Given any \(V \in \reals^n\), let \(π \in \GREEDY V\) be a greedy policy with respect to \(V\), i.e., \[\BELLMAN^* V = \BELLMAN^π V.\] Let \(V^π\) denote the performance of policy \(π\), i.e., \(V^π\) is unique fixed point of \(\BELLMAN^π\) and let \(V^*\) denote the optimal performance, i.e., \(V^*\) is the unique fixed point of \(\BELLMAN^*\). We refer to \(\NORM{V - V^*}_{∞}\) as the value error and \(\NORM{V - V^π}_{∞}\) as the policy error.

Show the following.

  1. The value approximation error of \(V\) can be bounded in terms of the one-step Bellman error of \(V\) as follows:

\[\begin{align*} \NORM{V^* - V}_{∞} &\le \frac{1}{1-γ} \NORM{\BELLMAN^* V - V}_{∞}, \\ \NORM{V^π - V}_{∞} &\le \frac{1}{1-γ} \NORM{\BELLMAN^π V - V}_{∞}. \end{align*}\]

  1. Performing an additional Bellman update on \(V\) reduces the value approximation error by a factor of \(γ\):

\[\begin{align*} \NORM{V^* - \BELLMAN^* V}_{∞} &\le \frac{γ}{1-γ} \NORM{\BELLMAN^* V - V}_{∞}, \\ \NORM{V^π - \BELLMAN^π V}_{∞} &\le \frac{γ}{1-γ} \NORM{\BELLMAN^π V - V}_{∞}. \end{align*}\]

  1. The sub-optimality gap of using the policy \(π\) can be bounded in terms of one step Bellman error of \(V\):

\[ \NORM{V^π - V^*}_{∞} \le \frac{2}{1-γ} \NORM{\BELLMAN^* V - V}_{∞}. \]

  1. A tighter bound on the sub-optimality gap can be obtained in terms of the value error of \(V\):

\[ \NORM{V^π - V^*}_{∞} \le \frac{2γ}{1 - γ} \NORM{V - V^*}_{∞}. \]

Exercise 15.2 Let \(\BELLMAN_1\) and \(\BELLMAN_2\) be two Bellman operators on the same space. Show that if \(\BELLMAN_1\) and \(\BELLMAN_2\) have the same unique fixed point then so does \(α \BELLMAN_1 + (1-α) \BELLMAN_2\), where \(α \in (0,1)\) is a constant.

Exercise 15.3 Given a Bellman operator \(\BELLMAN\) and \(λ \in (0,1)\), define the operator \(\BELLMAN_{λ}\) as follows: \[ \BELLMAN_{λ} v = (1-λ)\sum_{n=1}^∞ λ^{n-1} \BELLMAN^n v \] where \(\BELLMAN^n\) is the Bellman operator \(\BELLMAN\) applied \(n\)-times. Use the result of Exercise 15.2 to show that \(\BELLMAN_{λ}\) has the same fixed point as \(\BELLMAN\).

Exercise 15.4 Let \(V^*\) denote the optimal value function of an MDP. Suppose that two deterministic policies \(π_1\) and \(π_2\) are both greedy with respect to \(V^*\).

  1. Consider the stochastic policy \(π = α π_1 + (1-α) π_2\), which, at each stage picks policy \(π_1\) with probability \(α\) and picks policy \(π_2\) with probability \((1-α)\), where \(α \in (0,1)\) is a constant. Show that policy \(π\) is also optimal.

  2. Consider the periodic policy \(\bar π = (π_1, π_2, π_1, π_2, \dots)\) which chooses policy \(π_1\) at odd times and policy \(π_2\) at even times. Show that \(\bar π\) is also an optimal policy.

Notes

The discounted cost formulation for infinite-horizon MDPs is due to Blackwell (1965). The operator viewpoint and the use of contraction mappings in this setting were introduced by Denardo (1967).

The material included here is referenced from different sources. Perhaps the best sources to study this material are the books by Puterman (2014), Whittle (1982), and Bertsekas (2011).

The model of Example 15.1 is due to Phelps (1962).