ECSE 506: Stochastic Control and Decision Theory
Aditya Mahajan
Winter 2022
About  Lectures  Notes  Coursework
Consider an MDP with continuous state space \(\ALPHABET X\) and finite action space \(\ALPHABET U\). We denote this MDP by \(M = (\ALPHABET X, \ALPHABET U, c, p)\), where for simplicity we assume that the \(p\) is the density of the transition kernel.
Since the state space is continuous, in general, we cannot compute the value functions exactly. The simplest way to proceed is to discretize the state space \(\ALPHABET X\). In particular, let \(\{\ALPHABET X_1, \dots \ALPHABET X_n\}\) denote a partition of \(\ALPHABET X\) (i.e., \(\bigcup_{i=1}^n \ALPHABET X_i = \ALPHABET X\) and for any \(i \neq j\), \(\ALPHABET X_i \cap \ALPHABET X_j = \emptyset\)). Pick a representative point \(x_i \in \ALPHABET X_i\) and consider the “grid” \(\bar {\ALPHABET X} = \{x_1, \dots, x_n\}\) as the state space of the discretized MDP. We consider a finite state MDP \(\bar M = (\bar {\ALPHABET X}, \ALPHABET U, \bar c, \bar P)\), where \(\bar c\) is the restriction of \(c\) onto \(\bar {\ALPHABET X}\), and \(\bar P\) is given by \[\bar P(x_j  x_i, u) = \int_{\ALPHABET X_j} p(y  x_i, u) dy = p(\ALPHABET X_j  x_i, u). \]
Let’s consider infinite horizon discounted setup. Let \(V \colon \ALPHABET X \to \reals\) and \(g \colon \ALPHABET X \to \ALPHABET U\) be the value function and optimal policy of the original MDP \(M\) and let \(\bar W \colon \bar {\ALPHABET X} \to \reals\) and \(\bar h \colon \bar {\ALPHABET X} \to \ALPHABET U\) be the value function and optimal policy of the discretized MDP \(\bar M\). Define \(W \colon \ALPHABET X \to \reals\) and \(h \colon \ALPHABET X \to \reals\) to be piecewise constant extrapolation of \(\bar W\) and \(\bar h\) from \(\bar {\ALPHABET X}\) to \(\ALPHABET X\).
Note that the policy \(h\) is choosing the same action on all states in \(\ALPHABET X_i\).
We are interested in two questions:
 What is the error if \(W\) is used as an approximation for \(V\)?
 What is the error if the policy \(h\) is used instead of the optimal policy \(g\)?
We will answer these questions under the assumption that \(M\) is a \((L_c, L_p)\)Lipschitz MDP. In particular, we make the following assumptions:
 Assumpt. 1

 \((\ALPHABET X, d_X)\) is a bounded metric space
 For every \(u \in \ALPHABET U\), \(c(\cdot, u)\) is \(L_c\)Lipschitz.
 For every \(u \in \ALPHABET U\), \(p(\cdot  x, u)\) is \(L_p\)Lipschitz (with respect to the Kantorovich distance on probability measures).
Given any set \(\ALPHABET S \subset \ALPHABET X\), the diameter \(\text{diam}(\ALPHABET S)\) is defined as \(\sup_{x,y \in \ALPHABET S} d_X(x,y)\). Since \(d_X\) is assumed to be bounded, diameter of any subset of \(\ALPHABET X\) is finite. Let \(d := \max_{1 \le i \le n} \text{diam}(\ALPHABET X_i)\) denote the largest diameter of the grid cells \(\ALPHABET X_i\).
Let \(\mathcal B\) and \(\bar {\mathcal B}\) denote the Bellman update operators for MDP \(M\) and \(\bar M\).
 Lemma 1

For any function \(\bar w \colon \bar {\ALPHABET X} \to \reals\), let \(w \colon \ALPHABET X \to \reals\) be the piecewise constant extrapolation of \(\bar w\) from \(\bar {\ALPHABET X}\) to \(\ALPHABET X\). Then for any \(x_i \in \bar {\ALPHABET X}\), \[ [\bar {\mathcal B} \bar w](x_i) = [\mathcal B w](x_i).\]
 Remark

The above result states that \(\bar {\mathcal B} \bar w\) and \(\mathcal B w\) agree on the grid points \(\{x_1, \dots, x_n\}\).
Proof
Observe that \[\begin{align} [\bar {\mathcal B} \bar w](x_i) &= \min_{u \in \ALPHABET U} \Bigl\{ c(x_i, u) + β \sum_{j=1}^n \bar P(x_j  x_i, u) \bar w(x_j)\Bigr\} \notag \\ &= \min_{u \in \ALPHABET U} \Bigl\{ c(x_i, u) + β \sum_{j=1}^n p(\ALPHABET X_j  x_i, u) \bar w(x_j) \Bigr\} \notag \\ &= \min_{u \in \ALPHABET U} \Bigl\{ c(x_i, u) + β \sum_{j=1}^n \int_{\ALPHABET X_j} p(y  x_i, u) w(y)dy \Bigr\} \notag \\ &= \min_{u \in \ALPHABET U} \Bigl\{ c(x_i, u) + β \int_{\ALPHABET X} p(y  x_i, u) w(y)dy \Bigr\} \notag \\ &= [\mathcal B w](x_i) \tag*{$\Box$} \end{align}\]
 Lemma 2

For any grid point \(\bar x_i \in \bar {\ALPHABET X}\), we have \[ V(x_i)  \bar W(x_i) \le β \ V  W \. \]
Proof
Note that \(\bar W(x_i) = [\bar {\mathcal B} \bar W](x_i) = [\mathcal B W](x_i)\), where the last equality follows from Lemma 1. Thus, \[\begin{align} V(x_i)  \bar W(x_i) &=  [\mathcal B V](x_i)  [\mathcal B W](x_i)  \notag \\ &\le \ \mathcal BV  \mathcal BW \ \notag \\ &\stackrel{(a)}\le β\ V  W\ \notag \end{align}\] where \((a)\) follows from the discounting property. \(\Box\)
 Theorem 1

Suppose \(\beta L_p < 1\). Then the value functions \(V\) is Lipschitz with Lipschitz constant \(L_v = L_c/(1  β L_p)\). Moreover, \[ \ V  W \ \le \frac{L_v d}{1  β} = \frac{ L_c d } { (1β)(1  βL_p)}. \]
Proof
The Lipschitz continuity of \(V\) follows the results for Lipschitz continuous MDPs.
Now, consider a state \(x \in \ALPHABET X_i\). Then, \[\begin{align*}  V(x)  W(x)  &\le  V(x)  V(x_i) + V(x_i)  W(x) \\ &\le L_v d_X(x, x_i) +  V(x_i)  \bar W(x_i) \\ &\stackrel{(a)}\le L_v d + β\ V  W\ \end{align*}\] where (a) follows from Lemma 1. We get the result by rearranging terms. \(\Box\)
1 State abstraction or compression (Latent space models)
Let \(M = (\ALPHABET X, \ALPHABET U, p, c)\) be an MDP with discrete or continuous state space. Let \((\bar {\ALPHABET X}, d_{\bar X})\) be a metric space. Consider a compression or an abstraction function \(φ \colon \ALPHABET X \to \bar {\ALPHABET X}\). For any \(x \in \ALPHABET X\), \(φ(x)\) is the compressed or abstract state corresponding to the ground state \(x\) and the inverse image \(φ^{1}(\bar x)\) with \(\bar x \in \bar {\ALPHABET X}\) is the set of ground states that correspond to \(\bar x\) under abstraction function \(φ\). Note that \(\{ φ^{1}(\bar x) \mid \bar x \in \bar {\ALPHABET X} \}\) partitions the state space \(\ALPHABET X\).
For example, suppose \(\ALPHABET X = [0,1]^2\) and \(\bar {\ALPHABET X} = [0, 2]\) with \(φ(x_1, x_2) = x_1 + x_2\). Then \(φ^{1}(\bar x) = \{ (x_1, x_2) \in [0, 1]^2 : x_1 + x_2 = \bar x \}\) which is a staight line as shown in the figure on the right.
We want to define an abstract or compressed MDP \(\bar M = (\bar {\ALPHABET X}, \ALPHABET U, \bar p, \bar c)\). Unlike the situation with state quantization where \(\bar {\ALPHABET X}\) was a subset of \(\ALPHABET X\), here we have not imposed any relationship between \(\bar {\ALPHABET X}\) and \(\ALPHABET X\). Therefore, to define \(\bar p\) and \(\bar c\), we need a weighting function \(α \colon \ALPHABET X \to [0, 1]\), such that for each \(\bar x \in \bar {\ALPHABET X}\), \(\int_{x \in φ^{1}(\bar x)} α(x) dx = 1\). Then, for any \(\bar x, \bar y \in \bar{\ALPHABET X}\), define \[\bar c(\bar x, u) = \int_{x \in φ^{1}(\bar x)} α(x)c(x,u)dx, \qquad \bar p(\bar y  \bar x, u) = \int_{x \in φ^{1}(\bar x)} α(x) \int_{y \in φ^{1}(\bar y)} p(y  x, u) dx dy. \]
For example, in the example above, we need to choose a weight function \(α : [0, 1]^2 \to [0, 1]\) such that for any set \(A(\bar x) = \{ (x_1, x_2) \in [0, 1]^2 : x_1 + x_2 = 1\}\), we have that \(\int_{A} α(x) dx = 1\). One way to obtain such a weight function is to pick a probability density \(μ\) over \(\ALPHABET X\) and pick \(α(x)\) in \(φ^{1}(\bar x)\) as the corresponding conditional density given \(φ(x) = \bar x\). Then \(\bar c(\bar x, u)\) may be viewed as the conditional cost \(\EXP_μ[ c(X, u)  φ(X) = \bar x]\) and \(\bar p(\bar y  \bar x, u)\) may be viewed as \(\EXP_μ[ p(\{ Y \in \ALPHABET X : φ(Y) = \bar y \}  X, u)  φ(X) = x]\). If \(\ALPHABET X\) is compact, then the simplest option is to pick \(μ\) as the uniform density over \(\ALPHABET X\).
Now, as before, suppose \(V \colon \ALPHABET X \to \reals\) and \(g \colon \ALPHABET X \to \ALPHABET U\) are the value function and the optimal policy of the original MDP \(M\) and \(\bar W \colon \bar{\ALPHABET X} \to \reals\) and \(\bar h \colon \bar {\ALPHABET X} \to \ALPHABET U\) be the value function and optimal policy of the abstract MDP \(\bar M\). Define \(W \colon \ALPHABET X \to \reals\) and \(h \colon \ALPHABET X \to \ALPHABET U\) as \(W = \bar W ∘ φ\) and \(h = \bar h ∘ φ\).
We are interested in two questions:
 What is the error if \(W\) is used as an approximation for \(V\)?
 What is the error if the policy \(h\) is used instead of policy \(g\)?
We will answer these questions under the assumption that \(\bar M\) is a \((L_{\bar c}, L_{\bar p})\)Lipschitz. In particular, we will make the following assumptions:
For any \(x \in \ALPHABET X\) and \(\bar y \in \bar {\ALPHABET X}\), define \[ \hat p(\bar y  x, u) = \int_{y \in φ^{1}(\bar y)} p(y x, u) dy. \] Define \[\begin{align*} L_c^∞ &= \sup_{x \in \ALPHABET X, u \in \ALPHABET U} \big c(x,u)  \bar c(φ(x), u) \bigr, \\ L_p^∞ &= \sup_{x \in \ALPHABET X, u \in \ALPHABET U} K(\hat p(⋅  x, u), \bar p(⋅  φ(x), u)). \end{align*}\]
 Assumpt.

 The value function \(\bar W\) is Lipschitz
 \(L_c^∞\) and \(L_p^∞\) are bounded.
Note that this is different than state quantization where we were assuming that the original MDP \(M\) is Lipschitz. In this model, we haven’t even assumed that the original state space is a metric space.
 Lemma 3

Suppose the value function \(\bar W\) is Lipschitz with Lipschitz constant \(L_{\bar W}\). Then \[ \ V_h  W \ \le \frac{L_c^∞ + β L_{\bar W} L_p^∞}{1  β}. \]
Proof
Consider a state \(x \in \ALPHABET X\). Then, \[\begin{align} V_h(x) & W(x) =  V_h(x)  \bar W_{\bar h}(φ(x))  \notag \\ &\stackrel{(a)}\le c(x, h(x))  \bar c(φ(x), \bar h(φ(x)) \notag \\ & \quad + β \left \int_{\ALPHABET X} p(y  x, h(x)) V_h(y)dy  \int_{\bar {\ALPHABET X}} p(\bar y φ(x), \bar h(φ(x)) ) \bar W(\bar y) d \bar y \right \label{eq:split} \end{align}\] where \((a)\) follows from triangle inequality. Recall that \(h(x) = \bar h(φ(x))\). Thus, the first term is bounded by \(L_c^∞\). Now, let’s consider the second term of \eqref{eq:split} \[\begin{align} \hskip 2em & \hskip 2em \left \int_{\ALPHABET X} p(y  x, h(x)) V_h(y)dy  \int_{\bar {\ALPHABET X}} p(\bar y φ(x), \bar h(φ(x)) ) \bar W(\bar y) d \bar y \right \notag \\ &\stackrel{(b)}\le \left \int_{\ALPHABET X} p(y  x, h(x)) V_h(y)dy  \int_{\ALPHABET X} p(y x, h(x)) W(y) dy \right \notag \\ &\quad + \left \int_{\ALPHABET X} p(y  x, h(x)) W(y)dy  \int_{\bar {\ALPHABET X}} \bar p(\bar y φ(x), \bar h(\bar φ(x))) \bar W(\bar y) d\bar y \right \notag \\ &\stackrel{(c)}\le \int_{\ALPHABET X} p(y  x, h(x)) \left V_h(y)  W(y) \right dy \notag \\ &\quad + \left \int_{\bar {\ALPHABET X}} \int_{y \in φ^{1}(y)} p(y  x, h(x)) \bar W(\bar y) d y d\bar y  \int_{\bar {\ALPHABET X}} \bar p(\bar y φ(x), \bar h(\bar φ(x))) \bar W(\bar y) d\bar y \right \notag \\ &\le \V_h  W \ + L_{\bar W} L_p^∞ \label{eq:2nd} \end{align}\] where \((b)\) follows from triangle inequality and the first term of \((c)\) follows from triangle inequality and the second term follows from Lipschitz continuity of \(W\) and the definition of Kantorovich distance. We get the result by substituting \eqref{eq:2nd} in \eqref{eq:split} and rearranging terms. \(\Box\)
 Lemma 4

Suppose the value function \(\bar W\) is Lipschitz with Lipschitz constant \(L_{\bar W}\). Then, \[  V_h(x_1)  V_h(x_2)  \le L_{\bar W} d_{\bar X}(φ(x_1), φ(x_2)) + 2 \frac{L_c^∞ + β L_{\bar W} L_P^∞}{1  β}. \]
Proof
From the triangle inequality, we have \[V_h(x_1)  V_h(x_2)  \le V_h(x_1)  W(x_1) + V_h(x_2)  W(x_2) + W(x_1)  W(x_2). \]
Now, the first two terms can be bounded using Lemma 3. To bound the third term, note that \(W(x) = \bar W(φ(x))\). Thus, the third term is bounded by \(L_{\bar W}d_{\bar X}(φ(x_1), φ(x_2))\). Substituting these back in the above equation, give the bounds of the Lemma. \(\Box\)
 Theorem 2

Suppose the value function \(\bar W\) is Lipschitz with Lipschitz constant \(L_{\bar W}\). Then, \[ \ V  W \ \le \frac{L_c^∞ + β L_{\bar W} L_p^∞}{1  β}. \]
Proof
The proof idea is similar to Lemma 3. However, unlike the proof of Lemma 3, the policy used in value function \(V\) is not the same as the policy used in the value function \(W\). Nonetheless, we can exploit the fact that the policy used is optimal for the respective models to get bounds on the difference. In particular, \[ V(x)  W(x) \le V_h(x)  W(x) \le \frac{L_c^∞ + β L_{\bar W} L_p^∞}{1  β}.\]
To bound the other direction, note that \[\begin{align*} W(x) & V(x) = \bar W(φ(x))  V(x) \\ &\le \bar c(φ(x), g(x)) + β\int_{\bar {\ALPHABET X}} \bar p(\bar y  φ(x), g(x)) d\bar y \\ &\quad  \left[ c(x, g(x)) + β\int_{\ALPHABET X} p(y  x, g(x)) dy \right]. \end{align*}\] We can now follow steps similar to Lemma 3 to show that the above difference is bounded by \(\frac{L_c^∞ + β L_{\bar W} L_p^∞}{1  β}\).
Combining the two inequalities above establishes the result. \(\Box\)
 Theorem 3

Suppose the value function \(\bar W\) is Lipschitz with Lipschitz constant \(L_{\bar W}\). Then, \[ \V  V_h\ \le 2\frac{L_c^∞ + β L_{\bar W} L_p^∞}{1  β}. \]
Proof
From triangle inequality, we have \[ V(x)  V_h(x) \le V(x)  W(x) + V_h(x)  W(x). \] the result then follows from Lemma 3 and Theorem 2. \(\Box\)
References
The material on state quantization is taken from Hinderer (2005). These results first appeared in Bertsekas (1975).
The material on state abstraction is taken from Gelada et al. (2019). Surprisingly, Gelada et al. (2019) do not completely specify how to construct \(\bar c\) and \(\bar p\). The construction for these functions is taken from Li et al. (2006). There has been a lot of interest in understanding approximation bounds for state compression. See the references in Gelada et al. (2019).
This entry was last updated on 05 May 2022 and posted in MDP and tagged infinite horizon, discounted cost, lipschitz continuity, approximation bounds, state aggregation.