12  Infinite horizon MDPs

Updated

December 3, 2024

Keywords

infinite horizon, discounted cost, Bellman operator

A common way to approximate systems that run for a very large horizon is to assume that they run for an infinite horizon. There is an inherent homogeneity over time for infinite horizon system: the future depends only on the current state and not on the current time. Due to this homogeneity over time, we expect that the optimal policy should also be time-homogeneous. Therefore, the optimal policy for an infinite-horizon system should be easier to implement than the optimal policy for a finite horizon system, especially so when the horizon is large. This is one of the motivations for studying infinite horizon models.

The most common formulation for infinite horizon models is the discounted setup, where the cost function is assumed to be \[ J(π) = \EXP^{π}\Bigl[ \sum_{t=1}^\infty γ^{t-1} c_t(S_t, A_t) \Bigr] \] where \(γ \in (0,1)\) is called the discount factor.

Discount Factor

The idea of using discounting in MDPs is due to Blackwell (1965).

There are two interpretations of the discount factor \(γ\). The first interpretation is an economic interpretation to determine the present value of a utility that will be received in the future. For example, suppose a decision maker is indifferent between receiving 1 dollar today or \(s\) dollars tomorrow. This means that the decision maker discounts the future at a rate \(1/s\), so \(γ = 1/s\).

The second interpretation is that of an absorbing state. Suppose we are operating a machine that generates a value of $1 each day. However, there is a probability \(p\) that the machine will break down at the end of the day. Thus, the expected return for today is $1 while the expected return for tomorrow is \((1-p)\) (which is the probability that the machine is still working tomorrow). In this case, the discount factor is defined as \((1-p)\). See Shwartz (2001) for a detailed discussion of this alternative.

In the remainder of this section, we will study how to obtain a solution for such infinite horizon discounted cost models.

Note: Throughout this section, we assume that \(\ALPHABET S\) and \(\ALPHABET A\) are finite and \(|\ALPHABET S|= n\) and \(|\ALPHABET A| = m\).

Summary of the main result

The main result for infinite horizon MDPs is that there is there is no loss of optimality in restricting attention to time-homogenous deterministic Markov policies. In other words, for any time-varaying, possible history dependent and randomized policy, there is a time-homogeneous deterministic Markov policy that performs as well.

For finite horizon models, we have already established that there is no loss of optimality in restricting attention to deterministic Markov policies. So, the only new bit here is that for infinite horizon discounted cost models, we can restrict attention to time-homogeneous policies as well.

Intuitively, this seems obvious. The system from time 1 onwards looks the same as the system from time 2 onwards. So, the decision rule which is optimal at time 1 should also be optimal at time 2. This intuition is correct as long as the future behavior is stable, which is always the case for discounted models with bounded per-step cost. We formally estabish this rest below. Formally establishing this intuition is a bit more subtle in other models such as the infinite-horizon total cost, infinite-horizon long-term average cost, and even discounted cost models with unbounded per-step cost.

12.1 Performance of a time-homogeneous Markov policy

An infinite horizon Markov policy \((π_1, π_2, \ldots)\) is called time-homongeous1 if \(π_1 = π_2 = π_3 = \cdots\). With a slight abuse of notation, we will denote a time-homogenous policy \((π,π,\ldots)\) simply by \(π\).

1 Time-homogenous policies are sometimes also called stationary policies in the literature. I personally feel that the term stationary can be a bit confusing because stationary processes has a precise but different meaning in probability theory.

The expected discounted cost of a time-homogeneous Markov policy is given by \[ V^π(s) = \EXP^π\bigg[ \sum_{t=1}^∞ γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg].\]

We now introduce a vector notation. Let \(V^π\) denote the \(n × 1\) vector given by \([V^π]_s = V^π(s)\). Furthermore, define Define the \(n × 1\) vector \(c_π\) and the \(n × n\) matrix \(P_π\) as follows: \[ [c_π]_s = c(s, π(s)) \quad\text{and}\quad [P_π]_{ss'} = P_{ss'}(π(s)). \]

Proposition 12.1 The performance of a time-homogeneous Markov policy may be computed by solving a system of linear equations given by: \[ V^{π} = (1 - γ P_{π})^{-1} c_{π}. \]

We first observe that the dynamics under policy \(π\) are Markovian with transition probability matrix \(P_π\) and a cost function \(c_π\). Therefore, \[ \begin{align*} \EXP^π\big[ c(S_t, π(S_t)) \bigm| S_1 = s \big] &= \sum_{s' \in \ALPHABET S} \PR^π(S_t = s' | S_1 = s) c(s', π(s')) \\ &= \sum_{s' \in \ALPHABET S} [P_π^{t-1}]_{ss'} [c_π]_y \\ &= δ_s P_π^{t-1} c_π. \end{align*} \]

Now consider the performance of policy \(π\), which is given by \[ \begin{align*} V^π &= c_π + γ P_π c_π + γ^2 P_π^2 c_π + \cdots \\ &= c_π + γ P_π \big( c_π + γ P_π c_π + \cdots \big) \\ &= c_π + γ P_π V^π. \end{align*} \] Rearranging terms, we have \[ (I - γ P_π) V^π = c_π. \]

The :spectral radius \(ρ(γ P_d)\) of a matrix is upper bounded by its :spectral norm \(\lVert γ P_d \rVert = γ < 1\). Therefore, the matrix \((I - γ P_π)\) has an inverse and by left multiplying both sides by \((I - γ P_π)^{-1}\), we get \[ V^π = (I - γP_π)^{-1} c_π. \]

12.2 Bellman operators

To succinctly describe the analysys, we define two operaors:

  1. The Bellman operator for policy \(π\) denoted by \(\BELLMAN^π\): For any \(v \in \reals^n\), \[ \BELLMAN^π v = c_{π} + γ P_{π} v. \]

  2. The Bellman optimality operator denoted by \(\BELLMAN^*\): For any \(v \in \reals^n\): \[ [\BELLMAN^* v]_s = \min_{a \in \ALPHABET A} \Big\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v_y \Big\}.\] We say that a policy \(π\) is a greedy policy with respect to \(v\), written as \(π = \GREEDY(v)\), if \[ B^* v = B^π v. \]

Note that the Bellman optimality operator may also be written as2 \[ \BELLMAN^* v = \min_{π \in \Pi} \BELLMAN^π v, \] where \(\Pi\) denotes the set of all deterministic Markov policies.

2 This is true for general models only when the arg min at each state exists.

For any \(v \in \reals^n\), let \(\NORM{v}_{∞}\) denote the sup-norm of \(v\), i.e., \(\NORM{v}_{∞} := \sup_{s \in \ALPHABET S} \ABS{v(s)}\). Recall that \((\reals^n, \NORM{\cdot}_{∞})\) is a :Banach space

Proposition 12.2 The operators \(\BELLMAN^π\) and \(\BELLMAN^*\) are contractions under the sup-norm, i.e., for any \(v, w \in \reals^n\), \[ \NORM{\BELLMAN v - \BELLMAN w} \le γ \NORM{v - w}, \] where \(\BELLMAN \in \{\BELLMAN^π, \BELLMAN^*\}\).

First consider the case of \(\BELLMAN^π\). Note that \[ \BELLMAN^π v - \BELLMAN^π w = (c_{π} + γ P_{π} v) - (c_{π} + γ P_{π} w) = γ P_{π} (v - w). \] Therefore, \[ \NORM{\BELLMAN^π v - \BELLMAN^π w}_{∞} \le γ \NORM{v - w}_{∞} P_{π} \mathbb{1} = γ \NORM{v - w}_{∞}. \]

Now consider the case of \(\BELLMAN^*\). Fix a state \(s \in \ALPHABET S\) and consider \([\BELLMAN^* v](s) - [\BELLMAN^* w](s)\). In particular, let \(a^*\) be the optimal action in the right hand side of \([\BELLMAN^* w](s)\). Then, \[\begin{align*} [\BELLMAN^* v - \BELLMAN^* w](s) &= \min_{a \in \ALPHABET A}\bigl\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v(s') \bigr\} - \min_{a \in \ALPHABET A}\bigl\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) w(s') \bigr\} \\ &\le c(s,a^*) + γ \sum_{s'\in \ALPHABET S} P_{ss'}(a^*) v(s') - c(s,a^*) - γ \sum_{s'\in \ALPHABET S} P_{ss'}(a^*) w(s') \\ &\le γ \sum_{s' \in \ALPHABET S} P_{ss'}(a^*) \| v - w \| \\ &= γ \| v - w \|. \end{align*} \]

By a similar argument, we can show that \([\BELLMAN^* w - \BELLMAN^* v](s) \le γ \| v - w \|\), which proves the other side of the inequality.

Let’s revisit the result of Proposition 12.1, which may be written as \[ V^π = c_π + γ P_π V^π \] We can write the above using Bellman operators as: \[\begin{equation}\label{eq:policy-evaluation} V^π = \BELLMAN^π V^π. \end{equation}\]

Thus, we can equivalently state the result of Proposition 12.1 by saying that \(V^π\) is the unique bounded fixed point of \(\eqref{eq:policy-evaluation}\).

What about the Bellman optimality operator? Does the equation \(V^* = \BELLMAN V^*\) have a solution? The contraction property (Proposition 12.2) and the :Banach fixed point theorem imply the following.

Theorem 12.1 There is a unique bounded \(V^* \in \reals^n\) that satisfies the Bellman equation \[ V^* = \BELLMAN^* V^* \]

Moreover, if we start from any \(V^0 \in \reals^n\) and recursively define \(V^n = \BELLMAN^* V^{n-1}\) then \[ \lim_{n \to ∞} V^n = V^*. \]

12.3 Optimal time-homogeneous policy

What does \(V^*\) signify? What about a policy \(π^* \in \GREEDY(V^*)\)? We answer these questions below.

Proposition 12.3 Define \[ V^{\text{opt}}_∞(s) := \min_{π} \EXP^π \bigg[ \sum_{t=1}^∞ γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg], \] where the minimum is over all (possibly randomized) history dependent policies. Then, \[ V^{\text{opt}}_∞ = V^*, \] where \(V^*\) is the solution of the Bellman equation.

Since the state and action space are finite, without loss of optimality, we can assume that \(0 \le c(s,a) \le M\).

Consider the finite horizon truncation \[ V^{\text{opt}}_T(s) = \min_{π} \EXP^π\bigg[ \sum_{t=1}^T γ^{t-1} c(S_t, A_t) | S_1 = s \bigg]. \] From the results for finite horizon MDP, we have that \[ V^{\text{opt}}_T = (\BELLMAN^*)^T V^0 \] where \(V^0\) is the all zeros vector.

Now by construction, \[V^{\text{opt}}_∞(s) \ge V^{\text{opt}}_T(s) = [(\BELLMAN^*)^T V^0](s). \] Taking limit as \(T \to ∞\), we get that \[\begin{equation}\label{eq:1} V^{\text{opt}}_∞(s) \ge \lim_{T \to ∞} [(\BELLMAN^*)^T V^0](s) = V^*(s). \end{equation}\]

Since \(0 \le c(s,a) \le M\), for any \(T\), \[ \begin{align*} V^{\text{opt}}_∞(s) &\le \min_π \EXP^π \bigg[ \sum_{t=1}^T γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg] + \sum_{t=T+1}^∞ γ^{t-1} M \\ &= V^{\text{opt}}_T(s) + γ^T M / (1 - γ) \\ &= [(\BELLMAN^*)^T V^0](s) + γ^T M / (1-γ). \end{align*} \] Taking limit as \(T \to ∞\), we get that \[\begin{equation}\label{eq:2} V^{\text{opt}}_∞(s) \le \lim_{T \to ∞} \big\{ [(\BELLMAN^*)^T V^0](s) + γ^T M / (1-γ) \big\} = V^*(s). \end{equation}\]

From \eqref{eq:1} and \eqref{eq:2}, we get that \(V^{\text{opt}}_∞ = V^*\).

Thus, \(V^*\) is the optimal performance. Now we can formally state the main result.

Theorem 12.2 A time-homogeneous Markov policy \(π^*\) is optimal if and only if it satisfies \(π^* \in \GREEDY(V^*)\), where \(V^*\) is the unique uniformly bounded solution of the following fixed point equation: \[ V^* = \BELLMAN^* V^*. \]

Proof

From Theorem 12.1, we know that the equation \(V^* = \BELLMAN^* V^*\) has a unique bounded solution \(V^*\). From Proposition 12.3, we know that \(V^*\) is the optimal policy.

Take a policy \(π^*\) such that \(π^* \in \GREEDY(V^*)\). Note that \[ V^* = \BELLMAN^* V^* = \BELLMAN^{π^*} V^*. \] Thus, \(V^*\) is a fixed point of \(\BELLMAN^{π^*}\) and therefore equal to \(V^{π^*}\). Thus, any policy that is greedy wrt \(V^*\) has performance \(V^*\) and (by Proposition 12.3) is optimal.

12.4 Properties of Bellman operator

Proposition 12.4 The Bellman operator satisfies the following properties

  • Monotonicity. For any \(v, w \in \reals^n\), if \(v \le w\), then \(\BELLMAN^π v \le \BELLMAN^π w\) and \(\BELLMAN^* v \le \BELLMAN^* w\).
  • Discounting. For any \(v \in \reals^n\) and \(m \in \reals\), \(\BELLMAN^π (v + m \ONES) = \BELLMAN^π v + γ m \ONES\) and \(\BELLMAN^* (v + m \ONES) = \BELLMAN^* v + γ m \ONES\).

Recall that \[ \BELLMAN^π v = c_π + γ P_π v. \] So, monotonicity of \(\BELLMAN^π\) follows immediately from monotonicity of matrix multiplication for positive matrices.

Let \(μ\) be such that \(\BELLMAN^* w = \BELLMAN^μ w\). Then, \[ \BELLMAN^* v \le \BELLMAN^μ v \stackrel{(a)} \le \BELLMAN^μ w = \BELLMAN^* w, \] where \((a)\) uses the monotonicity of \(\BELLMAN^μ\).

Recall that \[ \BELLMAN^π v = c_π + γ P_π v. \] Thus, \[ \BELLMAN^π(v+m \ONES) = c_π + γ P_π (v+m \ONES) = c_π + γ P_π v + γ m \ONES = \BELLMAN^π v + γ m \ONES.\] Thus, \(\BELLMAN^π\) is discounting. Now consider \[ \BELLMAN^* (v + m \ONES ) = \min_{π} \BELLMAN^π (v+m \ONES) = \min_π (\BELLMAN^π v + γ m \ONES) = \BELLMAN^* v + γ m \ONES.\] Thus, \(\BELLMAN^*\) is discounting.

Proposition 12.5 For any \(V \in \reals^n\),

  • If \(V \ge \BELLMAN^* V\), then \(V \ge V^*\);
  • If \(V \le \BELLMAN^* V\), then \(V \le V^*\);
  • If \(V = \BELLMAN^* V\), then \(V\) is the only vector with this property and \(V = V^*\).

The same bounds are true when \((\BELLMAN^*, V^*)\) is replaced with \((\BELLMAN^π, V^π)\).

We prove the first part. The proof of the other parts is similar.

We are given that \[V \ge \BELLMAN^* V.\] Then, by monotonicity of the Bellman operator, \[ \BELLMAN^* V \ge \BELLMAN^*^2 V.\] Continuing this way, we get \[ \BELLMAN^*^k V \ge \BELLMAN^*^{k+1} V.\] Adding the above equations, we get \[ V \ge \BELLMAN^*^{k+1} V.\] Taking limit as \(k \to ∞\), we get \[V \ge V^*.\]

Proposition 12.6 For any \(V \in \reals^n\) and \(m \in \reals\),

  • If \(V + m \ONES \ge \BELLMAN^* V\), then \(V + m \ONES/(1-γ) \ge V^*\);
  • If \(V + m \ONES \le \BELLMAN^* V\), then \(V + m \ONES/(1-γ) \le V^*\);

The same bounds are true when \((\BELLMAN^*, V^*)\) is replaced with \((\BELLMAN^π, V^π)\).

Remark

The above result can also be stated as follows:

  • \(\displaystyle \| V^π - V \| \le \frac{1}{1-γ}\| \BELLMAN^π V - V \|\).
  • \(\displaystyle \| V^* - V \| \le \frac{1}{1-γ}\| \BELLMAN^* V - V \|\).

Again, we only prove the first part. The proof of the second part is the same. We have that \[ V + m \ONES \ge \BELLMAN^* V. \] From discounting and monotonicity properties, we get \[ \BELLMAN^* V + γ m \ONES \ge \BELLMAN^*^2 V. \] Again, from discounting and monotonitiy properties, we get \[ \BELLMAN^*^2 V + γ^2 m \ONES \ge \BELLMAN^*^3 V. \] Continuing this way, we get \[ \BELLMAN^*^k V + γ^k m \ONES \ge \BELLMAN^*^{k+1} V. \] Adding all the above equations, we get \[ V + \sum_{\ell = 0}^k γ^\ell m \ONES \ge \BELLMAN^*^{k+1} V. \] Taking the limit as \(k \to ∞\), we get \[ V + m \ONES/(1-γ) \ge V^*. \]

12.5 Countable or continuous state and action spaces

The discussion above was restricted to finite state and action spaces. In general, when working with non-finite state and action spaces, we need to be careful about existence of solutions. See Section 5.7 for a discussion.

For infinite horizon problems, there are additional technical challenges. We first point out via examples that the Bellman equation may not have a unique fixed point equation.

Example 12.1 (Phantom solutions of DP (Bertsekas 2011)) Consider an MDP with \(\ALPHABET S = [0,∞)\), \(\ALPHABET A = \{a_\circ\}\) (dummy action), \(p(ds'\mid s,a) = δ_{s' - s/γ}\) (so the state deterministically transitions to \(s/γ\)) and \(c(s,a) ≡ 0\). It is clear that the optimal value function \(V^*(s) ≡ 0\). However, the dynamic programming equation is \[ V^*(s) = γ V^*(s/γ) \] and is satisfied by any linear function \(V^*(s) = α s\)!

Did we just violate the Banach fixed point theorem?

To apply the Banach fixed point theorem, we need to be in a Banach space, i.e., a complete normed metric space. So, what was the Banach space in Theorem 12.1? Implicitly, we had used the space \[ \ALPHABET V^M \coloneqq \{ v \colon \ALPHABET S \to \reals \text{ such that } \| v \|_{∞} \le M \} \] where \(M\) is any constant greater than \(\|c\|_{∞}/(1-γ)\). Banach fixed point theorem says that DP has a unique fixed point in \(\ALPHABET V^M\). It does not say anything about solutions outside \(\ALPHABET V^M\)!

Among all the solutions, only the one corresponding to \(α = 0\) is bounded, which is also the optimal value function.

Example 12.2 Consider a birth-death Markov chain with \(0\) as an aborbing state. In particular, \(\ALPHABET S = \integers_{\ge 0}\), \(\ALPHABET A = \{a_\circ\}\) (dummy action), and \(c(s,a) = \IND\{s > 0\}\). The transition dynamics are given by: for \(s = 0\), \(P(s'|0,a) = \IND\{s' = 0\}\) and for \(s > 0\), we have \[ P(s'|s,a) = p \IND\{ s' = s+1\} + (1-p) \IND\{s' = s-1\}. \] So, the DP is given by \(v(0) = 0\) and for \(s > 0\) \[ V^*(s) = 1 + γ p V^*(s+1) + γ(1-p) V^*(s-1), \] which is a harmonic equation and has a generic solution of the form: \[ V^*(s) = \frac{1}{1-γ} - \biggl[ \frac{1}{1-γ} + C \biggr] q_1^s + C q_2^s \] where \(C\) is an arbitrary constant and \[ q_{1,2} = \frac{1 \pm \sqrt{1 - 4 γ^2 p (1-p)}}{2 γp}. \] It can be verified that \(q_1 \in (0,1)\) and \(q_2 \in (1,∞)\). So, every solution except \(C = 0\) is unbounded.

Again we have multiple solutions of the DP and the correct solution is the only bounded solution (which corresponds to \(C = 0\)).

Exercises

Exercise 12.1 (One-step look-ahead error bounds.) Given any \(V \in \reals^n\), let \(π\) be such that \(\BELLMAN^* V = \BELLMAN^π V\). Moreover, let \(V^*\) denote the unique fixed point of \(\BELLMAN^*\) and \(V^π\) denote the unique fixed point of \(\BELLMAN^π\). Then, show that

  1. \[ \| V^* - V \| \le \frac{1}{1-γ} \| \BELLMAN^* V - V \|. \]
  2. \[ \| V^* - \BELLMAN^* V \| \le \frac{γ}{1-γ} \| \BELLMAN^* V - V \|. \]
  3. \[ \| V^π - V \| \le \frac{1}{1-γ} \| \BELLMAN^π V - V \|. \]
  4. \[ \| V^π - \BELLMAN^π V \| \le \frac{γ}{1-γ} \| \BELLMAN^π V - V \|. \]
  5. \[ \| V^π - V^* \| \le \frac{2}{1-γ} \| \BELLMAN^* V - V \|. \]
  6. \[ \| V^π - V^* \| \le \frac{2γ}{1 - γ} \| V - V^* \|. \]

Exercise 12.2 Let \(\BELLMAN_1\) and \(\BELLMAN_2\) be two Bellman operators on the same space. Show that if \(\BELLMAN_1\) and \(\BELLMAN_2\) have the same unique fixed point then so does \(α \BELLMAN_1 + (1-α) \BELLMAN_2\), where \(α \in (0,1)\) is a constant.

Exercise 12.3 Given a Bellman operator \(\BELLMAN\) and \(λ \in (0,1)\), define the operator \(\BELLMAN_{λ}\) as follows: \[ \BELLMAN_{λ} v = \sum_{n=1}^∞ λ^{n-1} \BELLMAN^n v \] where \(\BELLMAN^n\) is the Bellman operator \(\BELLMAN\) applied \(n\)-times. Use the result of Exercise 12.2 to show that \(\BELLMAN_{λ}\) has the same fixed point as \(\BELLMAN\).

Exercise 12.4 Let \(V^*\) denote the optimal value function of an MDP. Suppose that two deterministic policies \(π_1\) and \(π_2\) are both greedy with respect to \(V^*\).

  1. Consider the stochastic policy \(π = α π_1 + (1-α) π_2\), which, at each stage picks policy \(π_1\) with probability \(α\) and picks policy \(π_2\) with probability \((1-α)\), where \(α \in (0,1)\) is a constant. Show that policy \(π\) is also optimal.

  2. Consider the periodic policy \(\bar π = (π_1, π_2, π_1, π_2, \dots)\) which chooses policy \(π_1\) at odd times and policy \(π_2\) at even times. Show that \(\bar π\) is also an optimal policy.

Notes

The material included here is referenced from different sources. Perhaps the best sources to study this material are the books by Puterman (2014), Whittle (1982), and Bertsekas (2011).