ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

Theory: Feature abstraction in MDPs

Consider an MDP with continuous state space \(\ALPHABET X\) and finite action space \(\ALPHABET U\). We denote this MDP by \(M = (\ALPHABET X, \ALPHABET U, c, p)\), where for simplicity we assume that the \(p\) is the density of the transition kernel.

Since the state space is continuous, in general, we cannot compute the value functions exactly. The simplest way to proceed is to discretize the state space \(\ALPHABET X\). In particular, let \(\{\ALPHABET X_1, \dots \ALPHABET X_n\}\) denote a partition of \(\ALPHABET X\) (i.e., \(\bigcup_{i=1}^n \ALPHABET X_i = \ALPHABET X\) and for any \(i \neq j\), \(\ALPHABET X_i \cap \ALPHABET X_j = \emptyset\)). Pick a representative point \(x_i \in \ALPHABET X_i\) and consider the “grid” \(\bar {\ALPHABET X} = \{x_1, \dots, x_n\}\) as the state space of the discretized MDP. We consider a finite state MDP \(\bar M = (\bar {\ALPHABET X}, \ALPHABET U, \bar c, \bar P)\), where \(\bar c\) is the restriction of \(c\) onto \(\bar {\ALPHABET X}\), and \(\bar P\) is given by \[\bar P(x_j | x_i, u) = \int_{\ALPHABET X_j} p(y | x_i, u) dy = p(\ALPHABET X_j | x_i, u). \]

Let’s consider infinite horizon discounted setup. Let \(V \colon \ALPHABET X \to \reals\) and \(g \colon \ALPHABET X \to \ALPHABET U\) be the value function and optimal policy of the original MDP \(M\) and let \(\bar W \colon \bar {\ALPHABET X} \to \reals\) and \(\bar h \colon \bar {\ALPHABET X} \to \ALPHABET U\) be the value function and optimal policy of the discretized MDP \(\bar M\). Define \(W \colon \ALPHABET X \to \reals\) and \(h \colon \ALPHABET X \to \reals\) to be piecewise constant extrapolation of \(\bar W\) and \(\bar h\) from \(\bar {\ALPHABET X}\) to \(\ALPHABET X\).

Note that the policy \(h\) is choosing the same action on all states in \(\ALPHABET X_i\).

We are interested in two questions:

  1. What is the error if \(W\) is used as an approximation for \(V\)?
  2. What is the error if the policy \(h\) is used instead of the optimal policy \(g\)?

We will answer these questions under the assumption that \(M\) is a \((L_c, L_p)\)-Lipschitz MDP. In particular, we make the following assumptions:

Assumpt. 1
  • \((\ALPHABET X, d_X)\) is a bounded metric space
  • For every \(u \in \ALPHABET U\), \(c(\cdot, u)\) is \(L_c\)-Lipschitz.
  • For every \(u \in \ALPHABET U\), \(p(\cdot | x, u)\) is \(L_p\)-Lipschitz (with respect to the Kantorovich distance on probability measures).

Given any set \(\ALPHABET S \subset \ALPHABET X\), the diameter \(\text{diam}(\ALPHABET S)\) is defined as \(\sup_{x,y \in \ALPHABET S} d_X(x,y)\). Since \(d_X\) is assumed to be bounded, diameter of any subset of \(\ALPHABET X\) is finite. Let \(d := \max_{1 \le i \le n} \text{diam}(\ALPHABET X_i)\) denote the largest diameter of the grid cells \(\ALPHABET X_i\).

Let \(\mathcal B\) and \(\bar {\mathcal B}\) denote the Bellman update operators for MDP \(M\) and \(\bar M\).

Lemma 1

For any function \(\bar w \colon \bar {\ALPHABET X} \to \reals\), let \(w \colon \ALPHABET X \to \reals\) be the piecewise constant extrapolation of \(\bar w\) from \(\bar {\ALPHABET X}\) to \(\ALPHABET X\). Then for any \(x_i \in \bar {\ALPHABET X}\), \[ [\bar {\mathcal B} \bar w](x_i) = [\mathcal B w](x_i).\]

Remark

The above result states that \(\bar {\mathcal B} \bar w\) and \(\mathcal B w\) agree on the grid points \(\{x_1, \dots, x_n\}\).

Proof

Observe that \[\begin{align} [\bar {\mathcal B} \bar w](x_i) &= \min_{u \in \ALPHABET U} \Bigl\{ c(x_i, u) + β \sum_{j=1}^n \bar P(x_j | x_i, u) \bar w(x_j)\Bigr\} \notag \\ &= \min_{u \in \ALPHABET U} \Bigl\{ c(x_i, u) + β \sum_{j=1}^n p(\ALPHABET X_j | x_i, u) \bar w(x_j) \Bigr\} \notag \\ &= \min_{u \in \ALPHABET U} \Bigl\{ c(x_i, u) + β \sum_{j=1}^n \int_{\ALPHABET X_j} p(y | x_i, u) w(y)dy \Bigr\} \notag \\ &= \min_{u \in \ALPHABET U} \Bigl\{ c(x_i, u) + β \int_{\ALPHABET X} p(y | x_i, u) w(y)dy \Bigr\} \notag \\ &= [\mathcal B w](x_i) \tag*{$\Box$} \end{align}\]

Lemma 2

For any grid point \(\bar x_i \in \bar {\ALPHABET X}\), we have \[ |V(x_i) - \bar W(x_i)| \le β \| V - W \|. \]

Proof

Note that \(\bar W(x_i) = [\bar {\mathcal B} \bar W](x_i) = [\mathcal B W](x_i)\), where the last equality follows from Lemma 1. Thus, \[\begin{align} |V(x_i) - \bar W(x_i)| &= | [\mathcal B V](x_i) - [\mathcal B W](x_i) | \notag \\ &\le \| \mathcal BV - \mathcal BW \| \notag \\ &\stackrel{(a)}\le β\| V - W\| \notag \end{align}\] where \((a)\) follows from the discounting property. \(\Box\)

Theorem 1

Suppose \(\beta L_p < 1\). Then the value functions \(V\) is Lipschitz with Lipschitz constant \(L_v = L_c/(1 - β L_p)\). Moreover, \[ \| V - W \| \le \frac{L_v d}{1 - β} = \frac{ L_c d } { (1-β)(1 - βL_p)}. \]

Proof

The Lipschitz continuity of \(V\) follows the results for Lipschitz continuous MDPs.

Now, consider a state \(x \in \ALPHABET X_i\). Then, \[\begin{align*} | V(x) - W(x) | &\le | V(x) - V(x_i)| + |V(x_i) - W(x)| \\ &\le L_v d_X(x, x_i) + | V(x_i) - \bar W(x_i)| \\ &\stackrel{(a)}\le L_v d + β\| V - W\| \end{align*}\] where (a) follows from Lemma 1. We get the result by rearranging terms. \(\Box\)

1 State abstraction or compression (Latent space models)

Let \(M = (\ALPHABET X, \ALPHABET U, p, c)\) be an MDP with discrete or continuous state space. Let \((\bar {\ALPHABET X}, d_{\bar X})\) be a metric space. Consider a compression or an abstraction function \(φ \colon \ALPHABET X \to \bar {\ALPHABET X}\). For any \(x \in \ALPHABET X\), \(φ(x)\) is the compressed or abstract state corresponding to the ground state \(x\) and the inverse image \(φ^{-1}(\bar x)\) with \(\bar x \in \bar {\ALPHABET X}\) is the set of ground states that correspond to \(\bar x\) under abstraction function \(φ\). Note that \(\{ φ^{-1}(\bar x) \mid \bar x \in \bar {\ALPHABET X} \}\) partitions the state space \(\ALPHABET X\).

Example of latent space embedding
Example of latent space embedding

For example, suppose \(\ALPHABET X = [0,1]^2\) and \(\bar {\ALPHABET X} = [0, 2]\) with \(φ(x_1, x_2) = x_1 + x_2\). Then \(φ^{-1}(\bar x) = \{ (x_1, x_2) \in [0, 1]^2 : x_1 + x_2 = \bar x \}\) which is a staight line as shown in the figure on the right.

We want to define an abstract or compressed MDP \(\bar M = (\bar {\ALPHABET X}, \ALPHABET U, \bar p, \bar c)\). Unlike the situation with state quantization where \(\bar {\ALPHABET X}\) was a subset of \(\ALPHABET X\), here we have not imposed any relationship between \(\bar {\ALPHABET X}\) and \(\ALPHABET X\). Therefore, to define \(\bar p\) and \(\bar c\), we need a weighting function \(α \colon \ALPHABET X \to [0, 1]\), such that for each \(\bar x \in \bar {\ALPHABET X}\), \(\int_{x \in φ^{-1}(\bar x)} α(x) dx = 1\). Then, for any \(\bar x, \bar y \in \bar{\ALPHABET X}\), define \[\bar c(\bar x, u) = \int_{x \in φ^{-1}(\bar x)} α(x)c(x,u)dx, \qquad \bar p(\bar y | \bar x, u) = \int_{x \in φ^{-1}(\bar x)} α(x) \int_{y \in φ^{-1}(\bar y)} p(y | x, u) dx dy. \]

For example, in the example above, we need to choose a weight function \(α : [0, 1]^2 \to [0, 1]\) such that for any set \(A(\bar x) = \{ (x_1, x_2) \in [0, 1]^2 : x_1 + x_2 = 1\}\), we have that \(\int_{A} α(x) dx = 1\). One way to obtain such a weight function is to pick a probability density \(μ\) over \(\ALPHABET X\) and pick \(α(x)\) in \(φ^{-1}(\bar x)\) as the corresponding conditional density given \(φ(x) = \bar x\). Then \(\bar c(\bar x, u)\) may be viewed as the conditional cost \(\EXP_μ[ c(X, u) | φ(X) = \bar x]\) and \(\bar p(\bar y | \bar x, u)\) may be viewed as \(\EXP_μ[ p(\{ Y \in \ALPHABET X : φ(Y) = \bar y \} | X, u) | φ(X) = x]\). If \(\ALPHABET X\) is compact, then the simplest option is to pick \(μ\) as the uniform density over \(\ALPHABET X\).

Now, as before, suppose \(V \colon \ALPHABET X \to \reals\) and \(g \colon \ALPHABET X \to \ALPHABET U\) are the value function and the optimal policy of the original MDP \(M\) and \(\bar W \colon \bar{\ALPHABET X} \to \reals\) and \(\bar h \colon \bar {\ALPHABET X} \to \ALPHABET U\) be the value function and optimal policy of the abstract MDP \(\bar M\). Define \(W \colon \ALPHABET X \to \reals\) and \(h \colon \ALPHABET X \to \ALPHABET U\) as \(W = \bar W ∘ φ\) and \(h = \bar h ∘ φ\).

We are interested in two questions:

  1. What is the error if \(W\) is used as an approximation for \(V\)?
  2. What is the error if the policy \(h\) is used instead of policy \(g\)?

We will answer these questions under the assumption that \(\bar M\) is a \((L_{\bar c}, L_{\bar p})\)-Lipschitz. In particular, we will make the following assumptions:

For any \(x \in \ALPHABET X\) and \(\bar y \in \bar {\ALPHABET X}\), define \[ \hat p(\bar y | x, u) = \int_{y \in φ^{-1}(\bar y)} p(y| x, u) dy. \] Define \[\begin{align*} L_c^∞ &= \sup_{x \in \ALPHABET X, u \in \ALPHABET U} \big| c(x,u) - \bar c(φ(x), u) \bigr|, \\ L_p^∞ &= \sup_{x \in \ALPHABET X, u \in \ALPHABET U} K(\hat p(⋅ | x, u), \bar p(⋅ | φ(x), u)). \end{align*}\]

Assumpt.
  • The value function \(\bar W\) is Lipschitz
  • \(L_c^∞\) and \(L_p^∞\) are bounded.

Note that this is different than state quantization where we were assuming that the original MDP \(M\) is Lipschitz. In this model, we haven’t even assumed that the original state space is a metric space.

Lemma 3

Suppose the value function \(\bar W\) is Lipschitz with Lipschitz constant \(L_{\bar W}\). Then \[ \| V_h - W \| \le \frac{L_c^∞ + β L_{\bar W} L_p^∞}{1 - β}. \]

Proof

Consider a state \(x \in \ALPHABET X\). Then, \[\begin{align} |V_h(x) &- W(x)| = | V_h(x) - \bar W_{\bar h}(φ(x)) | \notag \\ &\stackrel{(a)}\le |c(x, h(x)) - \bar c(φ(x), \bar h(φ(x))| \notag \\ & \quad + β \left| \int_{\ALPHABET X} p(y | x, h(x)) V_h(y)dy - \int_{\bar {\ALPHABET X}} p(\bar y| φ(x), \bar h(φ(x)) ) \bar W(\bar y) d \bar y \right| \label{eq:split} \end{align}\] where \((a)\) follows from triangle inequality. Recall that \(h(x) = \bar h(φ(x))\). Thus, the first term is bounded by \(L_c^∞\). Now, let’s consider the second term of \eqref{eq:split} \[\begin{align} \hskip 2em & \hskip -2em \left| \int_{\ALPHABET X} p(y | x, h(x)) V_h(y)dy - \int_{\bar {\ALPHABET X}} p(\bar y| φ(x), \bar h(φ(x)) ) \bar W(\bar y) d \bar y \right| \notag \\ &\stackrel{(b)}\le \left| \int_{\ALPHABET X} p(y | x, h(x)) V_h(y)dy - \int_{\ALPHABET X} p(y| x, h(x)) W(y) dy \right| \notag \\ &\quad + \left| \int_{\ALPHABET X} p(y | x, h(x)) W(y)dy - \int_{\bar {\ALPHABET X}} \bar p(\bar y| φ(x), \bar h(\bar φ(x))) \bar W(\bar y) d\bar y \right| \notag \\ &\stackrel{(c)}\le \int_{\ALPHABET X} p(y | x, h(x)) \left| V_h(y) - W(y) \right| dy \notag \\ &\quad + \left| \int_{\bar {\ALPHABET X}} \int_{y \in φ^{-1}(y)} p(y | x, h(x)) \bar W(\bar y) d y d\bar y - \int_{\bar {\ALPHABET X}} \bar p(\bar y| φ(x), \bar h(\bar φ(x))) \bar W(\bar y) d\bar y \right| \notag \\ &\le \|V_h - W \| + L_{\bar W} L_p^∞ \label{eq:2nd} \end{align}\] where \((b)\) follows from triangle inequality and the first term of \((c)\) follows from triangle inequality and the second term follows from Lipschitz continuity of \(W\) and the definition of Kantorovich distance. We get the result by substituting \eqref{eq:2nd} in \eqref{eq:split} and rearranging terms. \(\Box\)

Lemma 4

Suppose the value function \(\bar W\) is Lipschitz with Lipschitz constant \(L_{\bar W}\). Then, \[ | V_h(x_1) - V_h(x_2) | \le L_{\bar W} d_{\bar X}(φ(x_1), φ(x_2)) + 2 \frac{L_c^∞ + β L_{\bar W} L_P^∞}{1 - β}. \]

Proof

From the triangle inequality, we have \[|V_h(x_1) - V_h(x_2) | \le |V_h(x_1) - W(x_1)| + |V_h(x_2) - W(x_2)| + |W(x_1) - W(x_2)|. \]

Now, the first two terms can be bounded using Lemma 3. To bound the third term, note that \(W(x) = \bar W(φ(x))\). Thus, the third term is bounded by \(L_{\bar W}d_{\bar X}(φ(x_1), φ(x_2))\). Substituting these back in the above equation, give the bounds of the Lemma. \(\Box\)

Theorem 2

Suppose the value function \(\bar W\) is Lipschitz with Lipschitz constant \(L_{\bar W}\). Then, \[ \| V - W \| \le \frac{L_c^∞ + β L_{\bar W} L_p^∞}{1 - β}. \]

Proof

The proof idea is similar to Lemma 3. However, unlike the proof of Lemma 3, the policy used in value function \(V\) is not the same as the policy used in the value function \(W\). Nonetheless, we can exploit the fact that the policy used is optimal for the respective models to get bounds on the difference. In particular, \[ V(x) - W(x) \le V_h(x) - W(x) \le \frac{L_c^∞ + β L_{\bar W} L_p^∞}{1 - β}.\]

To bound the other direction, note that \[\begin{align*} W(x) &- V(x) = \bar W(φ(x)) - V(x) \\ &\le \bar c(φ(x), g(x)) + β\int_{\bar {\ALPHABET X}} \bar p(\bar y | φ(x), g(x)) d\bar y \\ &\quad - \left[ c(x, g(x)) + β\int_{\ALPHABET X} p(y | x, g(x)) dy \right]. \end{align*}\] We can now follow steps similar to Lemma 3 to show that the above difference is bounded by \(\frac{L_c^∞ + β L_{\bar W} L_p^∞}{1 - β}\).

Combining the two inequalities above establishes the result. \(\Box\)

Theorem 3

Suppose the value function \(\bar W\) is Lipschitz with Lipschitz constant \(L_{\bar W}\). Then, \[ \|V - V_h\| \le 2\frac{L_c^∞ + β L_{\bar W} L_p^∞}{1 - β}. \]

Proof

From triangle inequality, we have \[ |V(x) - V_h(x)| \le |V(x) - W(x)| + |V_h(x) - W(x)|. \] the result then follows from Lemma 3 and Theorem 2\(\Box\)

References

The material on state quantization is taken from Hinderer (2005). These results first appeared in Bertsekas (1975).

The material on state abstraction is taken from Gelada et al. (2019). Surprisingly, Gelada et al. (2019) do not completely specify how to construct \(\bar c\) and \(\bar p\). The construction for these functions is taken from Li et al. (2006). There has been a lot of interest in understanding approximation bounds for state compression. See the references in Gelada et al. (2019).


Bertsekas, D. 1975. Convergence of discretization procedures in dynamic programming. IEEE Transactions on Automatic Control 20, 3, 415–419. DOI: 10.1109/TAC.1975.1100984.
Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M.G. 2019. DeepMDP: Learning continuous latent space models for representation learning. Proceedings of the 36th international conference on machine learning, PMLR, 2170–2179. Available at: http://proceedings.mlr.press/v97/gelada19a.html.
Hinderer, K. 2005. Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research 62, 1, 3–22. DOI: 10.1007/s00186-005-0438-1.
Li, L., Walsh, T.J., and Littman, M.L. 2006. Towards a unified theory of state abstraction for MDPs. ISAIM. Available at: http://anytime.cs.umass.edu/aimath06/proceedings/P21.pdf.

This entry was last updated on 05 May 2022 and posted in MDP and tagged infinite horizon, discounted cost, lipschitz continuity, approximation bounds, state aggregation.