10 Infinite horizon MDPs
A common way to approximate systems that run for a very large horizon is to assume that they run for an infinite horizon. There is an inherent homogeneity over time for infinite horizon system: the future depends only on the current state and not on the current time. Due to this homogeneity over time, we expect that the optimal policy should also be time-homogeneous. Therefore, the optimal policy for an infinite-horizon system should be easier to implement than the optimal policy for a finite horizon system, especially so when the horizon is large. This is one of the motivations for studying infinite horizon models.
The most common formulation for infinite horizon models is the discounted setup, where the cost function is assumed to be \[ J(π) = \EXP\Bigl[ \sum_{t=1}^\infty γ^{t-1} c_t(S_t, A_t) \Bigr] \] where \(γ \in (0,1)\) is called the discount factor.
In the remainder of this section, we will study how to obtain a solution for such infinite horizon discounted cost models.
Note: Throughout this section, we assume that \(\ALPHABET S\) and \(\ALPHABET A\) are finite and \(|\ALPHABET S|= n\) and \(|\ALPHABET A| = m\).
10.1 Performance of a time-homogeneous Markov policy
For any \(π \colon \ALPHABET S \to \ALPHABET A\), consider the time homogeneous policy \((π, π, \dots)\). For ease of notation, we denote this policy simply by \(π\). The expected discounted cost under this policy is given by \[ V_π(s) = \EXP^π\bigg[ \sum_{t=1}^∞ γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg].\]
To get a compact expression for this, define a \(n × 1\) vector \(c_π\) and a \(n × n\) matrix \(P_π\) as follows: \[ [c_π]_s = c(s, π(s)) \quad\text{and}\quad [P_π]_{ss'} = P_{ss'}(π(s)). \] Then the dynamics under policy \(π\) are Markovian with transition probability matrix \(P_π\) and a cost function \(c_π\). Then \[ \begin{align*} \EXP^π\big[ c(S_t, π(S_t)) \bigm| S_1 = s \big] &= \sum_{s' \in \ALPHABET S} \PR^π(S_t = s' | S_1 = s) c(s', π(s')) \\ &= \sum_{s' \in \ALPHABET S} [P_π^{t-1}]_{ss'} [c_π]_y \\ &= δ_s P_π^{t-1} c_π. \end{align*} \]
Let \(V_π\) denote the \(n × 1\) vector given by \([V_π]_s = V_π(s)\). Then, \[ \begin{align*} V_π &= c_π + γ P_π c_π + γ^2 P_π^2 c_π + \cdots \\ &= c_π + γ P_π \big( c_π + γ P_π c_π + \cdots \big) \\ &= c_π + γ P_π V_π, \end{align*} \] which can be rewritten as \[ (I - γ P_π) V_π = c_π. \]
The :spectral radius \(ρ(γ P_d)\) of a matrix is upper bounded by its :spectral norm \(\lVert γ P_d \rVert = γ < 1\). Therefore, the matrix \((I - γ P_π)\) has an inverse and by left multiplying both sides by \((I - γ P_π)^{-1}\), we get \[ V_π = (I - γP_π)^{-1} c_π. \]
The equation \[ V_π = c_π + γ P_π V_π \] is sometimes also written as \[ V_π = \mathcal B_π V_π \] where the operator \(\mathcal B_π\), which is called the Bellman operator, is an operator from \(\reals^n\) to \(\reals^n\) given by \[ \mathcal B_π v = c_π + γ P_π v.\]
10.2 Bellman operators
Definition 10.1 Define the Bellman operator \(\mathcal B : \reals^n \to \reals^n\) as follows: for any \(v \in \reals^n\) \[ [\mathcal B v]_s = \min_{a \in \ALPHABET A} \Big\{ c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) v_y \Big\}. \]
Note that the above may also be written as1 \[ \mathcal B v = \min_{π \in \Pi} \mathcal B_π v, \] where \(\Pi\) denotes the set of all deterministic Markov policies.
1 This is true for general models only when the arg min at each state exists.
Proposition 10.1 For any \(v \in \reals^n\), define the norm \(\NORM{V} := \sup_{s \in \ALPHABET S} \ABS{V_s}\). Then, the Bellman operator is a contraction, i.e., for any \(v, w \in \reals^n\), \[ \NORM{\mathcal B v - \mathcal B w} \le γ \NORM{v - w}. \]
An immediate consequence of the contraction property is the following.
Theorem 10.1 There is a unique bounded \(V^* \in \reals^n\) that satisfies the Bellman equation \[ V = \mathcal B V \]
Moreover, if we start from any \(V_0 \in \reals^n\) and recursively define \[ V_n = \mathcal B V_{n-1} \] then \[ \lim_{n \to ∞} V_n = V^*. \]
10.3 Optimal time-homogeneous policy
Proposition 10.2 Define \[ V^{\text{opt}}_∞(s) := \min_{π} \EXP^π \bigg[ \sum_{t=1}^∞ γ^{t-1} c(S_t, A_t) \biggm| S_1 = s \bigg], \] where the minimum is over all (possibly randomized) history dependent policies. Then, \[ V^{\text{opt}}_∞ = V^*, \] where \(V^*\) is the solution of the Bellman equation.
10.4 Properties of Bellman operator
Proposition 10.3 The Bellman operator satisfies the following properties
- Monotonicity. For any \(v, w \in \reals^n\), if \(v \le w\), then \(\mathcal B_π v \le \mathcal B_π w\) and \(\mathcal B v \le \mathcal B w\).
- Discounting. For any \(v \in \reals^n\) and \(m \in \reals\), \(\mathcal B_π (v + m \ONES) = \mathcal B_π v + γ m \ONES\) and \(\mathcal B (v + m \ONES) = \mathcal B v + γ m \ONES\).
Proposition 10.4 For any \(V \in \reals^n\),
- If \(V \ge \mathcal B V\), then \(V \ge V^*\);
- If \(V \le \mathcal B V\), then \(V \le V^*\);
- If \(V = \mathcal B V\), then \(V\) is the only vector with this property and \(V = V^*\).
The same bounds are true when \((\mathcal B, V^*)\) is replaced with \((\mathcal B_π, V_π)\).
Proposition 10.5 For any \(V \in \reals^n\) and \(m \in \reals\),
- If \(V + m \ONES \ge \mathcal B V\), then \(V + m \ONES/(1-γ) \ge V^*\);
- If \(V + m \ONES \le \mathcal B V\), then \(V + m \ONES/(1-γ) \le V^*\);
The same bounds are true when \((\mathcal B, V^*)\) is replaced with \((\mathcal B_π, V_π)\).
Exercises
Exercise 10.1 (One-step look-ahead error bounds.) Given any \(V \in \reals^n\), let \(π\) be such that \(\mathcal B V = \mathcal B_π V\). Moreover, let \(V^*\) denote the unique fixed point of \(\mathcal B\) and \(V_π\) denote the unique fixed point of \(\mathcal B_π\). Then, show that
- \[ \| V^* - V \| \le \frac{1}{1-γ} \| \mathcal B V - V \|. \]
- \[ \| V^* - \mathcal B V \| \le \frac{γ}{1-γ} \| \mathcal B V - V \|. \]
- \[ \| V_π - V \| \le \frac{1}{1-γ} \| \mathcal B_π V - V \|. \]
- \[ \| V_π - \mathcal B_π V \| \le \frac{γ}{1-γ} \| \mathcal B_π V - V \|. \]
- \[ \| V_π - V^* \| \le \frac{2}{1-γ} \| \mathcal B V - V \|. \]
- \[ \| V_π - V^* \| \le \frac{2γ}{1 - γ} \| V - V^* \|. \]
Notes
The material included here is referenced from different sources. Perhaps the best sources to study this material are the books by Puterman (2014), Whittle (1982), and Bertsekas (2011).