12 Infinite horizon MDPs
infinite horizon, discounted cost, Bellman operator
A common way to approximate systems that run for a very large horizon is to assume that they run for an infinite horizon. There is an inherent homogeneity over time for infinite horizon system: the future depends only on the current state and not on the current time. Due to this homogeneity over time, we expect that the optimal policy should also be time-homogeneous. Therefore, the optimal policy for an infinite-horizon system should be easier to implement than the optimal policy for a finite horizon system, especially so when the horizon is large. This is one of the motivations for studying infinite horizon models.
The most common formulation for infinite horizon models is the discounted setup, where the cost function is assumed to be \[ J(π) = \EXP^{π}\Bigl[ \sum_{t=1}^\infty γ^{t-1} c_t(S_t, A_t) \Bigr] \] where \(γ \in (0,1)\) is called the discount factor.
The idea of using discounting in MDPs is due to Blackwell (1965).
There are two interpretations of the discount factor \(γ\). The first interpretation is an economic interpretation to determine the present value of a utility that will be received in the future. For example, suppose a decision maker is indifferent between receiving 1 dollar today or \(s\) dollars tomorrow. This means that the decision maker discounts the future at a rate \(1/s\), so \(γ = 1/s\).
The second interpretation is that of an absorbing state. Suppose we are operating a machine that generates a value of $1 each day. However, there is a probability \(p\) that the machine will break down at the end of the day. Thus, the expected return for today is $1 while the expected return for tomorrow is \((1-p)\) (which is the probability that the machine is still working tomorrow). In this case, the discount factor is defined as \((1-p)\). See Shwartz (2001) for a detailed discussion of this alternative.
In the remainder of this section, we will study how to obtain a solution for such infinite horizon discounted cost models.
Note: Throughout this section, we assume that \(\ALPHABET S\) and \(\ALPHABET A\) are finite and \(|\ALPHABET S|= n\) and \(|\ALPHABET A| = m\).
12.1 Performance of a time-homogeneous Markov policy
An infinite horizon Markov policy \((π_1, π_2, \ldots)\) is called time-homongeous1 if \(π_1 = π_2 = π_3 = \cdots\). With a slight abuse of notation, we will denote a time-homogenous policy \((π,π,\ldots)\) simply by \(π\).
1 Time-homogenous policies are sometimes also called stationary policies in the literature. I personally feel that the term stationary can be a bit confusing because stationary processes has a precise but different meaning in probability theory.
The expected discounted cost of a time-homogeneous Markov policy is given by \[ V^π(s) = \EXP^π\bigg[ \sum_{t=1}^∞ γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg].\]
We now introduce a vector notation. Let \(V^π\) denote the \(n × 1\) vector given by \([V^π]_s = V^π(s)\). Furthermore, define Define the \(n × 1\) vector \(c_π\) and the \(n × n\) matrix \(P_π\) as follows: \[ [c_π]_s = c(s, π(s)) \quad\text{and}\quad [P_π]_{ss'} = P_{ss'}(π(s)). \]
Proposition 12.1 The performance of a time-homogeneous Markov policy may be computed by solving a system of linear equations given by: \[ V^{π} = (1 - γ P_{π})^{-1} c_{π}. \]
We first observe that the dynamics under policy \(π\) are Markovian with transition probability matrix \(P_π\) and a cost function \(c_π\). Therefore, \[ \begin{align*} \EXP^π\big[ c(S_t, π(S_t)) \bigm| S_1 = s \big] &= \sum_{s' \in \ALPHABET S} \PR^π(S_t = s' | S_1 = s) c(s', π(s')) \\ &= \sum_{s' \in \ALPHABET S} [P_π^{t-1}]_{ss'} [c_π]_y \\ &= δ_s P_π^{t-1} c_π. \end{align*} \]
Now consider the performance of policy \(π\), which is given by \[ \begin{align*} V^π &= c_π + γ P_π c_π + γ^2 P_π^2 c_π + \cdots \\ &= c_π + γ P_π \big( c_π + γ P_π c_π + \cdots \big) \\ &= c_π + γ P_π V^π. \end{align*} \] Rearranging terms, we have \[ (I - γ P_π) V^π = c_π. \]
The :spectral radius \(ρ(γ P_d)\) of a matrix is upper bounded by its :spectral norm \(\lVert γ P_d \rVert = γ < 1\). Therefore, the matrix \((I - γ P_π)\) has an inverse and by left multiplying both sides by \((I - γ P_π)^{-1}\), we get \[ V^π = (I - γP_π)^{-1} c_π. \]
12.2 Bellman operators
To succinctly describe the analysys, we define two operaors:
The Bellman operator for policy \(π\) denoted by \(\BELLMAN^π\): For any \(v \in \reals^n\), \[ \BELLMAN^π v = c_{π} + γ P_{π} v. \]
The Bellman optimality operator denoted by \(\BELLMAN^*\): For any \(v \in \reals^n\): \[ [\BELLMAN^* v]_s = \min_{a \in \ALPHABET A} \Big\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v_y \Big\}.\] We say that a policy \(π\) is a greedy policy with respect to \(v\), written as \(π = \GREEDY(v)\), if \[ B^* v = B^π v. \]
Note that the Bellman optimality operator may also be written as2 \[ \BELLMAN^* v = \min_{π \in \Pi} \BELLMAN^π v, \] where \(\Pi\) denotes the set of all deterministic Markov policies.
2 This is true for general models only when the arg min at each state exists.
For any \(v \in \reals^n\), let \(\NORM{v}_{∞}\) denote the sup-norm of \(v\), i.e., \(\NORM{v}_{∞} := \sup_{s \in \ALPHABET S} \ABS{v(s)}\). Recall that \((\reals^n, \NORM{\cdot}_{∞})\) is a :Banach space
Proposition 12.2 The operators \(\BELLMAN^π\) and \(\BELLMAN^*\) are contractions under the sup-norm, i.e., for any \(v, w \in \reals^n\), \[ \NORM{\BELLMAN v - \BELLMAN w} \le γ \NORM{v - w}, \] where \(\BELLMAN \in \{\BELLMAN^π, \BELLMAN^*\}\).
First consider the case of \(\BELLMAN^π\). Note that \[ \BELLMAN^π v - \BELLMAN^π w = (c_{π} + γ P_{π} v) - (c_{π} + γ P_{π} w) = γ P_{π} (v - w). \] Therefore, \[ \NORM{\BELLMAN^π v - \BELLMAN^π w}_{∞} \le γ \NORM{v - w}_{∞} P_{π} \mathbb{1} = γ \NORM{v - w}_{∞}. \]
Now consider the case of \(\BELLMAN^*\). Fix a state \(s \in \ALPHABET S\) and consider \([\BELLMAN^* v](s) - [\BELLMAN^* w](s)\). In particular, let \(a^*\) be the optimal action in the right hand side of \([\BELLMAN^* w](s)\). Then, \[\begin{align*} [\BELLMAN^* v - \BELLMAN^* w](s) &= \min_{a \in \ALPHABET A}\bigl\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v(s') \bigr\} - \min_{a \in \ALPHABET A}\bigl\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) w(s') \bigr\} \\ &\le c(s,a^*) + γ \sum_{s'\in \ALPHABET S} P_{ss'}(a^*) v(s') - c(s,a^*) - γ \sum_{s'\in \ALPHABET S} P_{ss'}(a^*) w(s') \\ &\le γ \sum_{s' \in \ALPHABET S} P_{ss'}(a^*) \| v - w \| \\ &= γ \| v - w \|. \end{align*} \]
By a similar argument, we can show that \([\BELLMAN^* w - \BELLMAN^* v](s) \le γ \| v - w \|\), which proves the other side of the inequality.
Let’s revisit the result of Proposition 12.1, which may be written as \[ V^π = c_π + γ P_π V^π \] We can write the above using Bellman operators as: \[\begin{equation}\label{eq:policy-evaluation} V^π = \BELLMAN^π V^π. \end{equation}\]
Thus, we can equivalently state the result of Proposition 12.1 by saying that \(V^π\) is the unique bounded fixed point of \(\eqref{eq:policy-evaluation}\).
What about the Bellman optimality operator? Does the equation \(V^* = \BELLMAN V^*\) have a solution? The contraction property (Proposition 12.2) and the :Banach fixed point theorem imply the following.
Theorem 12.1 There is a unique bounded \(V^* \in \reals^n\) that satisfies the Bellman equation \[ V^* = \BELLMAN^* V^* \]
Moreover, if we start from any \(V^0 \in \reals^n\) and recursively define \(V^n = \BELLMAN^* V^{n-1}\) then \[ \lim_{n \to ∞} V^n = V^*. \]
12.3 Optimal time-homogeneous policy
What does \(V^*\) signify? What about a policy \(π^* \in \GREEDY(V^*)\)? We answer these questions below.
Proposition 12.3 Define \[ V^{\text{opt}}_∞(s) := \min_{π} \EXP^π \bigg[ \sum_{t=1}^∞ γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg], \] where the minimum is over all (possibly randomized) history dependent policies. Then, \[ V^{\text{opt}}_∞ = V^*, \] where \(V^*\) is the solution of the Bellman equation.
Since the state and action space are finite, without loss of optimality, we can assume that \(0 \le c(s,a) \le M\).
Consider the finite horizon truncation \[ V^{\text{opt}}_T(s) = \min_{π} \EXP^π\bigg[ \sum_{t=1}^T γ^{t-1} c(S_t, A_t) | S_1 = s \bigg]. \] From the results for finite horizon MDP, we have that \[ V^{\text{opt}}_T = (\BELLMAN^*)^T V^0 \] where \(V^0\) is the all zeros vector.
Now by construction, \[V^{\text{opt}}_∞(s) \ge V^{\text{opt}}_T(s) = [(\BELLMAN^*)^T V^0](s). \] Taking limit as \(T \to ∞\), we get that \[\begin{equation}\label{eq:1} V^{\text{opt}}_∞(s) \ge \lim_{T \to ∞} [(\BELLMAN^*)^T V^0](s) = V^*(s). \end{equation}\]
Since \(0 \le c(s,a) \le M\), for any \(T\), \[ \begin{align*} V^{\text{opt}}_∞(s) &\le \min_π \EXP^π \bigg[ \sum_{t=1}^T γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg] + \sum_{t=T+1}^∞ γ^{t-1} M \\ &= V^{\text{opt}}_T(s) + γ^T M / (1 - γ) \\ &= [(\BELLMAN^*)^T V^0](s) + γ^T M / (1-γ). \end{align*} \] Taking limit as \(T \to ∞\), we get that \[\begin{equation}\label{eq:2} V^{\text{opt}}_∞(s) \le \lim_{T \to ∞} \big\{ [(\BELLMAN^*)^T V^0](s) + γ^T M / (1-γ) \big\} = V^*(s). \end{equation}\]
From \eqref{eq:1} and \eqref{eq:2}, we get that \(V^{\text{opt}}_∞ = V^*\).
Thus, \(V^*\) is the optimal performance. Now we can formally state the main result.
Theorem 12.2 A time-homogeneous Markov policy \(π^*\) is optimal if and only if it satisfies \(π^* \in \GREEDY(V^*)\), where \(V^*\) is the unique uniformly bounded solution of the following fixed point equation: \[ V^* = \BELLMAN^* V^*. \]
From Theorem 12.1, we know that the equation \(V^* = \BELLMAN^* V^*\) has a unique bounded solution \(V^*\). From Proposition 12.3, we know that \(V^*\) is the optimal policy.
Take a policy \(π^*\) such that \(π^* \in \GREEDY(V^*)\). Note that \[ V^* = \BELLMAN^* V^* = \BELLMAN^{π^*} V^*. \] Thus, \(V^*\) is a fixed point of \(\BELLMAN^{π^*}\) and therefore equal to \(V^{π^*}\). Thus, any policy that is greedy wrt \(V^*\) has performance \(V^*\) and (by Proposition 12.3) is optimal.
12.4 Properties of Bellman operator
Proposition 12.4 The Bellman operator satisfies the following properties
- Monotonicity. For any \(v, w \in \reals^n\), if \(v \le w\), then \(\BELLMAN^π v \le \BELLMAN^π w\) and \(\BELLMAN^* v \le \BELLMAN^* w\).
- Discounting. For any \(v \in \reals^n\) and \(m \in \reals\), \(\BELLMAN^π (v + m \ONES) = \BELLMAN^π v + γ m \ONES\) and \(\BELLMAN^* (v + m \ONES) = \BELLMAN^* v + γ m \ONES\).
Recall that \[ \BELLMAN^π v = c_π + γ P_π v. \] So, monotonicity of \(\BELLMAN^π\) follows immediately from monotonicity of matrix multiplication for positive matrices.
Let \(μ\) be such that \(\BELLMAN^* w = \BELLMAN^μ w\). Then, \[ \BELLMAN^* v \le \BELLMAN^μ v \stackrel{(a)} \le \BELLMAN^μ w = \BELLMAN^* w, \] where \((a)\) uses the monotonicity of \(\BELLMAN^μ\).
Recall that \[ \BELLMAN^π v = c_π + γ P_π v. \] Thus, \[ \BELLMAN^π(v+m \ONES) = c_π + γ P_π (v+m \ONES) = c_π + γ P_π v + γ m \ONES = \BELLMAN^π v + γ m \ONES.\] Thus, \(\BELLMAN^π\) is discounting. Now consider \[ \BELLMAN^* (v + m \ONES ) = \min_{π} \BELLMAN^π (v+m \ONES) = \min_π (\BELLMAN^π v + γ m \ONES) = \BELLMAN^* v + γ m \ONES.\] Thus, \(\BELLMAN^*\) is discounting.
Proposition 12.5 For any \(V \in \reals^n\),
- If \(V \ge \BELLMAN^* V\), then \(V \ge V^*\);
- If \(V \le \BELLMAN^* V\), then \(V \le V^*\);
- If \(V = \BELLMAN^* V\), then \(V\) is the only vector with this property and \(V = V^*\).
The same bounds are true when \((\BELLMAN^*, V^*)\) is replaced with \((\BELLMAN^π, V^π)\).
We prove the first part. The proof of the other parts is similar.
We are given that \[V \ge \BELLMAN^* V.\] Then, by monotonicity of the Bellman operator, \[ \BELLMAN^* V \ge \BELLMAN^*^2 V.\] Continuing this way, we get \[ \BELLMAN^*^k V \ge \BELLMAN^*^{k+1} V.\] Adding the above equations, we get \[ V \ge \BELLMAN^*^{k+1} V.\] Taking limit as \(k \to ∞\), we get \[V \ge V^*.\]
Proposition 12.6 For any \(V \in \reals^n\) and \(m \in \reals\),
- If \(V + m \ONES \ge \BELLMAN^* V\), then \(V + m \ONES/(1-γ) \ge V^*\);
- If \(V + m \ONES \le \BELLMAN^* V\), then \(V + m \ONES/(1-γ) \le V^*\);
The same bounds are true when \((\BELLMAN^*, V^*)\) is replaced with \((\BELLMAN^π, V^π)\).
The above result can also be stated as follows:
- \(\displaystyle \| V^π - V \| \le \frac{1}{1-γ}\| \BELLMAN^π V - V \|\).
- \(\displaystyle \| V^* - V \| \le \frac{1}{1-γ}\| \BELLMAN^* V - V \|\).
Again, we only prove the first part. The proof of the second part is the same. We have that \[ V + m \ONES \ge \BELLMAN^* V. \] From discounting and monotonicity properties, we get \[ \BELLMAN^* V + γ m \ONES \ge \BELLMAN^*^2 V. \] Again, from discounting and monotonitiy properties, we get \[ \BELLMAN^*^2 V + γ^2 m \ONES \ge \BELLMAN^*^3 V. \] Continuing this way, we get \[ \BELLMAN^*^k V + γ^k m \ONES \ge \BELLMAN^*^{k+1} V. \] Adding all the above equations, we get \[ V + \sum_{\ell = 0}^k γ^\ell m \ONES \ge \BELLMAN^*^{k+1} V. \] Taking the limit as \(k \to ∞\), we get \[ V + m \ONES/(1-γ) \ge V^*. \]
12.5 Countable or continuous state and action spaces
The discussion above was restricted to finite state and action spaces. In general, when working with non-finite state and action spaces, we need to be careful about existence of solutions. See Section 5.7 for a discussion.
For infinite horizon problems, there are additional technical challenges. We first point out via examples that the Bellman equation may not have a unique fixed point equation.
Example 12.1 (Phantom solutions of DP (Bertsekas 2011)) Consider an MDP with \(\ALPHABET S = [0,∞)\), \(\ALPHABET A = \{a_\circ\}\) (dummy action), \(p(ds'\mid s,a) = δ_{s' - s/γ}\) (so the state deterministically transitions to \(s/γ\)) and \(c(s,a) ≡ 0\). It is clear that the optimal value function \(V^*(s) ≡ 0\). However, the dynamic programming equation is \[ V^*(s) = γ V^*(s/γ) \] and is satisfied by any linear function \(V^*(s) = α s\)!
To apply the Banach fixed point theorem, we need to be in a Banach space, i.e., a complete normed metric space. So, what was the Banach space in Theorem 12.1? Implicitly, we had used the space \[ \ALPHABET V^M \coloneqq \{ v \colon \ALPHABET S \to \reals \text{ such that } \| v \|_{∞} \le M \} \] where \(M\) is any constant greater than \(\|c\|_{∞}/(1-γ)\). Banach fixed point theorem says that DP has a unique fixed point in \(\ALPHABET V^M\). It does not say anything about solutions outside \(\ALPHABET V^M\)!
Among all the solutions, only the one corresponding to \(α = 0\) is bounded, which is also the optimal value function.
Example 12.2 Consider a birth-death Markov chain with \(0\) as an aborbing state. In particular, \(\ALPHABET S = \integers_{\ge 0}\), \(\ALPHABET A = \{a_\circ\}\) (dummy action), and \(c(s,a) = \IND\{s > 0\}\). The transition dynamics are given by: for \(s = 0\), \(P(s'|0,a) = \IND\{s' = 0\}\) and for \(s > 0\), we have \[ P(s'|s,a) = p \IND\{ s' = s+1\} + (1-p) \IND\{s' = s-1\}. \] So, the DP is given by \(v(0) = 0\) and for \(s > 0\) \[ V^*(s) = 1 + γ p V^*(s+1) + γ(1-p) V^*(s-1), \] which is a harmonic equation and has a generic solution of the form: \[ V^*(s) = \frac{1}{1-γ} - \biggl[ \frac{1}{1-γ} + C \biggr] q_1^s + C q_2^s \] where \(C\) is an arbitrary constant and \[ q_{1,2} = \frac{1 \pm \sqrt{1 - 4 γ^2 p (1-p)}}{2 γp}. \] It can be verified that \(q_1 \in (0,1)\) and \(q_2 \in (1,∞)\). So, every solution except \(C = 0\) is unbounded.
Again we have multiple solutions of the DP and the correct solution is the only bounded solution (which corresponds to \(C = 0\)).
Exercises
Exercise 12.1 (One-step look-ahead error bounds.) Given any \(V \in \reals^n\), let \(π\) be such that \(\BELLMAN^* V = \BELLMAN^π V\). Moreover, let \(V^*\) denote the unique fixed point of \(\BELLMAN^*\) and \(V^π\) denote the unique fixed point of \(\BELLMAN^π\). Then, show that
- \[ \| V^* - V \| \le \frac{1}{1-γ} \| \BELLMAN^* V - V \|. \]
- \[ \| V^* - \BELLMAN^* V \| \le \frac{γ}{1-γ} \| \BELLMAN^* V - V \|. \]
- \[ \| V^π - V \| \le \frac{1}{1-γ} \| \BELLMAN^π V - V \|. \]
- \[ \| V^π - \BELLMAN^π V \| \le \frac{γ}{1-γ} \| \BELLMAN^π V - V \|. \]
- \[ \| V^π - V^* \| \le \frac{2}{1-γ} \| \BELLMAN^* V - V \|. \]
- \[ \| V^π - V^* \| \le \frac{2γ}{1 - γ} \| V - V^* \|. \]
Notes
The material included here is referenced from different sources. Perhaps the best sources to study this material are the books by Puterman (2014), Whittle (1982), and Bertsekas (2011).