ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

Theory: State aggregation or discretization or quantization

So far, we have studied exact solutions to the dynamic program. When the state space is large (or possibly continuous), an exact solution is not possible due to computational limitations. So, we need to look at approximate solutions.

The simplest form of approximate solution is state aggregation, in which we partition the state space into equivalence classes and assign one state in each class as a representative element of that class. When the state space is continuous, the procedure is called state discretization or state quantization. We will use the state quantization terminology in these notes.

1 System model

Consider an MDP with abstract state space \(\ALPHABET S\) and finite action space \(\ALPHABET A\). We denote this MDP by \(M = (\ALPHABET S, \ALPHABET A, c, p)\). For simplicity, we assume that \(\ALPHABET S\) is continuous (and compact), and that the \(p\) is the density of the transition kernel. Note that we are using the term “probability density” in the engineering sense (so, it may be a combination of a continuous function and delta functions) rather than in the precise measure theoretic sense (where it is the Radon-Nikodym derivative with respect to the Lebesque measure).

If exact computations were possible, we can find an optimal solution by solving the following dynamic program: \[ V = \mathcal B V, \] that is \[ V(s) = \min_{a \in \ALPHABET A} \biggl\{ c(s,a) + \gamma \int_{\ALPHABET S} p(s'|s,a) V(s') ds' \biggr\}, \quad \forall s \in \ALPHABET S. \] Let \(V^*\) denote the optimal value function and \(π^*\) denote the optimal policy.

However, since the state space is continuous, we cannot compute the value functions exactly. The simplest way to proceed is to discretize or quantize the state space \(\ALPHABET S\). In particular, let \(\{\ALPHABET S_1, \dots \ALPHABET S_n\}\) denote a partition of \(\ALPHABET S\) (i.e., \(\bigcup_{i=1}^n \ALPHABET S_i = \ALPHABET S\) and for any \(i \neq j\), \(\ALPHABET S_i \cap \ALPHABET S_j = \emptyset\)). Pick a representative point \(\hat s_i \in \ALPHABET S_i\). We can think of the “grid points” \(\hat {\ALPHABET S} = \{\hat s_1, \dots, \hat s_n\}\) as quantization of the state space \(\ALPHABET S\). To simplify the notation, we define a quantization function \(\phi \colon \ALPHABET S \to \hat {\ALPHABET S}\) which maps all points in \(\ALPHABET S_i\) to the representative element \(\hat s_i\).

We consider a finite state MDP \(\hat M = (\hat {\ALPHABET S}, \ALPHABET A, \hat c, \hat P)\), where \(\hat c\) is the restriction of \(c\) onto \(\hat {\ALPHABET S}\), and \(\hat P\) is given by \[\hat P(\hat s_j | \hat s_i, a) = \int_{\ALPHABET S_j} p(s' | \hat s_i, a) dy = p(\ALPHABET S_j | \hat s_i, a). \]

Suppose \(\hat W^* \colon \hat S \to \reals\) be the optimal value function and \(\hat μ^* \colon \hat {\ALPHABET S} \to \ALPHABET A\) be the optimal policy for the approximate model. Define \(W^* \colon \ALPHABET S \to \reals\) and \(μ^* \colon \ALPHABET S \to \reals\) to be piecewise constant extrapolation of \(\hat W^*\) and \(\hat μ^*\) from \(\hat {\ALPHABET S}\) to \(\ALPHABET S\), i.e., \[ W^*(s) = \hat W^*(\phi(s)) \quad\text{and}\quad μ^*(s) = \hat μ^*(\phi(s)). \] Note that the policy \(μ^*\) chooses the same action on all states in a quantization cell \(\ALPHABET S_i\).

We are interested in two questions:

  1. Error in value approximation: What is the error if \(W^*\) is used as an approximation for \(V^*\)?
  2. Error in policy approximation: What is the error if the policy \(μ^*\) is used instead of the optimal policy \(π^*\)?

2 Bounds on value and policy approximation error

2.1 Preliminary results

We start by some preliminary results to build the intuition behind the approximation bounds. We start with a property of the discretized transition matrix.

Lemma 1

For any \(\hat V \colon \hat {\ALPHABET S} \to \reals\), let \(V \colon \ALPHABET S \to \reals\) be its piecewise constant extrapolation from \(\hat {\ALPHABET S}\) to \(\ALPHABET S\) (i.e., \(V = \hat V \circ \phi\)). Then, for any \(\hat s \in \ALPHABET S\) and \(a \in \ALPHABET A\), we have \[ \int_{\ALPHABET S} p(s' | \hat s,a) V(s') ds' = \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s' | \hat s, a) \hat V(\hat s'). \]

Proof

Recall that \(\{\ALPHABET S_1, \dots, \ALPHABET S_n\}\) is a partition of \(\ALPHABET S\) and \(\ALPHABET S_i = \phi^{-1}(\hat s_i)\). Therefore, \[\begin{align*} \int_{\ALPHABET S} p(s'|\hat s, a) V(s') ds' &= \sum_{\hat s' \in \hat {\ALPHABET S}} \int_{\phi^{-1}(\hat s')} p(s' | \hat s; a) \hat V(\phi(s')) ds' \\ &= \sum_{\hat s' \in \hat {\ALPHABET S}} \hat V(\hat s') \int_{\phi^{-1}(\hat s')} p(s' | \hat s; a) ds' \\ &= \sum_{\hat s' \in \hat {\ALPHABET S}} \hat V(\hat s') \hat P(\hat s' | \hat s, a). \end{align*}\]

An immediate consequence of Lemma 1 is the following:

Lemma 2

For any \(\hat V \colon \hat {\ALPHABET S} \to \reals\), let \(V \colon \ALPHABET S \to \reals\) be its piecewise constant extrapolation from \(\hat {\ALPHABET S}\) to \(\ALPHABET S\) (i.e., \(V = \hat V \circ \phi\)). Define, one-step update functions: \[\begin{align} W(s) &= \min_{a \in \ALPHABET A}\biggl\{ c(s,a) + γ \int_{\ALPHABET S} p(s' | s,a) V(s') ds' \biggr\}, \label{eq:one-step-a}\\ \hat W(\hat s) &= \min_{a \in \ALPHABET A} \biggl\{ \hat c(\hat s,a) + γ \sum_{\hat s \in \ALPHABET S} \hat P(\hat s' | s,a) \hat V(\hat s') \biggr\}. \label{eq:one-step-b} \end{align}\] Then, \[\begin{equation}\label{eq:one-step} W(\hat s) = \hat W(\hat s), \quad \forall \hat s \in \hat {\ALPHABET S}. \end{equation}\]

Proof

Let \(π\) be the optimal policy for \eqref{eq:one-step-a} and \(\hat π\) be the optimal policy for \eqref{eq:one-step-b}. Fix a state \(\hat s \in \hat {\ALPHABET S}\) and let \(a = π(\hat s)\) and \(\hat a = \hat π(\hat s)\). Then, \[\begin{align*} W(\hat s) &= c(\hat s, a) + γ \int_{\ALPHABET S} p( s' | \hat s, a) V(s') ds' \\ &\stackrel{(a)}= \hat c(\hat s, a) + γ \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s'|\hat s, a) \hat V(\hat s') \\ &\ge \hat W(\hat s), \end{align*}\] where \((a)\) follows from definition of \(\hat c\) and Lemma 1.

Similarly, we have \[\begin{align*} \hat W(\hat s) &= \hat c(\hat s, \hat a) + γ \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s'|\hat s, \hat a) \hat V(\hat s') \\ &\stackrel{(b)}= c(\hat s, \hat a) + γ \int_{\ALPHABET S} p( s' | \hat s, \hat a) \hat V(s') ds' \\ &\ge W(\hat s), \end{align*}\] where \((b)\) follows from definition of \(\hat c\) and Lemma 1.

Thus, \(W(\hat s) = \hat W(\hat s)\).

Lemma 1 shows that for any quantization point \(\hat s\) and action \(a\), computing the expectation of the future cost to go function \(\hat V \colon \hat {\ALPHABET S} \to \reals\) with respect to the measure \(\hat P\) is the same as computing the expectation of the piecewise linear extrapolation \(V\) of \(\hat V\) with respect to the original measure \(p\). Lemma 2 shows that the one step Bellman update of a function \(\hat V\) coincides with the one-step Bellman update of its piecewise constant extrapolation \(V\) at quantization points \(\hat s \in \hat {\ALPHABET S}\).

Note that these equivalences are valid only at quantization points \(\hat s \in \hat {\ALPHABET S}\) and not for other points in \(\ALPHABET S\).

2.2 Bounding the error for value function approximation

We will present two bounds on the value function approximation. For the first bound, define \[ H_{\max} = \sup_{s \in \ALPHABET S}| V^*(s) - V^*(\phi(s)) |. \]

Then, we can bound the error for value function approximation as follows.

Proposition 1

\[\NORM{ V^* - W^*}_∞ \le \frac{H_{\max}}{1-γ} . \]

Proof

Consider any \(s \in \ALPHABET S\). Then, \[\begin{align} \bigl| V^*(s) - W^*(s) \bigr| &\le \bigl| V^*(s) - V^*(\phi(s)) \bigr| + \bigl| V^*(\phi(s)) - \hat W^*(\phi(s)) \bigr| \notag \\ &\le H_{\max} + \bigl| V^*(\phi(s)) - \hat W^*(\phi(s)) \bigr| \label{eq:step-a-1} \end{align}\]

For the ease of notation let \(\mathcal B\) and \(\hat {\mathcal B}\) denote the Bellman operators for model \(M\) and \(\hat M\), respectively. Now, we know that \(\hat W^* = \hat {\mathcal B} \hat W^*\) and by Lemma 2 \([\hat {\mathcal B} \hat W^*](\hat s) = [\mathcal B W^*](\hat s)\). Thus, we can write the second term of \eqref{eq:step-a-1} as follows: \[\begin{align*} \bigl| V^*(\phi(s)) - \hat W^*(\phi(s)) \bigr| &= \bigl| [\mathcal B V^*](\phi(s)) - [ \mathcal B W^* ](\phi(s)) \bigr| \\ &\le γ \NORM{V^* - W^*}_∞. \end{align*}\] Substituting back in \eqref{eq:step-a-1}, we get \[\bigl| V^*(s) - W^*(s) \bigr| \le H_{\max} + γ \NORM{V^* - W^*}_∞.\] We get the result by supermizing over \(s\) and rearranging the terms.

Note that the result of Proposition 1 is what is called an “instance-dependent” bound: it depends on the value function \(V^*\). In some situations, we want an ’instance-indendent” bound, which does not depend the value function.

As a first step to obtain an instance-indendent bound, we assume that \(\ALPHABET S\) is a bounded metric space with metric \(d_S\). Suppose the original MDP \(M\) satisfies some regularity properties such that the value function \(V^*\) is Lipschitz with Lipschitz constant \(\NORM{V^*}_L\), i.e., for any \(s,s' \in \ALPHABET S\) \[ \bigl| V^*(s) - V^*(s') \bigr| \le \NORM{V^*}_L d_S(s,s'). \] This means that for any \(s \in \ALPHABET S\), \[\begin{equation}\label{eq:lip} \bigl| V(s) - V(\phi(s)) \bigr| \le \NORM{V^*}_L D, \end{equation}\] where \(D = \sup_{s \in \ALPHABET S} d_S(s,\phi(s))\) is the largest radius of the quantization cells. Note that since \(\ALPHABET S\) is assumed to be a bounded metric space, \(D\) is finite. An immediate implication of \eqref{eq:lip} is that \[ H_{\max} \le \NORM{V^*}_L D. \]

Substituting this in Proposition 1, we get

Proposition 2

\[\NORM{ V^* - W^*}_∞ \le \frac{D}{1-γ} \NORM{V^*}_L . \]

The advantage of the bound in Proposition 2 is that we can obtain an “instance-independent” upper bound on \(\NORM{V^*}_L\). In particular, assume that the model \(M\) is a \((L_c, L_p)\)-Lipschitz MDP.

Assumpt. 1
  • For every \(a \in \ALPHABET A\), \(c(s, a)\) is \(L_c\)-Lipschitz in \(s\)
  • For every \(a \in \ALPHABET A\), \(p(\cdot | s, a)\) is \(L_p\)-Lipschitz in \(s\) (with respect to the Kantorovich distance on probability measures).

Under this assumption, Theorem 1 of Lipschitz MDPs, implies that \(\NORM{V^*}_L \le L_c/(1 - γ L_p)\). Thus, we have the following:

Corollary 1

Under Assumpt. 1, if \(γ L_p < 1\), then \[ \NORM{V^* - W^*}_∞ \le \frac{L_c}{(1-γ)(1-γL_p)} D. \]

2.3 Bounding the error for policy approximation

Using the same idea as Proposition 2, it is possible to show that \[ \NORM{V^{μ^*} - W^*}_∞ \le \frac{D}{1-γ} \NORM{V^{μ^*}}_L. \] Combining this with Proposition 2, we get \[ \NORM{V^{μ^*} - V^*}_∞ \le \NORM{V^{μ^*} - W^*}_∞ + \NORM{V^* - W^*}_∞ \le \frac{D}{1-γ} [ \NORM{V^*}_L + \NORM{V^{μ^*}}_L ]. \]

As in Corollary 1, if we assume that the model is Lipschitz, then we can get a bound on \(\NORM{V^*}_L\) in terms of Lipschitz constants on the cost function and the transition dynamics. However, it is difficult to bound \(\NORM{V^μ}_L\) because that bound will be interms of the Lipschitz constant of the policy \(μ^* = \hat μ^* \circ \phi\). So, we provide an alternative bound on \(\NORM{V^{μ^*} - W^*}_∞\) in this section.

Proposition 3

Under Assumpt. 1, we have \[\NORM{V^{μ^*} - V^*}_∞ \le \frac{D}{1-γ} \biggl[ L_c + γ L_p \NORM{V^*}_L + \frac{1 + γ}{1-γ} \NORM{V^*}_L \biggr]. \]

Furthermore, if \(γ L_p < 1\), then from properties of Lipschitz MDPs, we know that \(\NORM{V^*}_L \le L_c/(1- γ L_p)\). Thus, \[\NORM{V^{μ^*} - V^*}_∞ \le \frac{2 D L_c }{ (1-γ)^2 (1-γ L_p) }. \]

Proof

Fix a state \(s \in \ALPHABET S\). Let \(\hat s = \phi(s)\) and \(a = \hat μ^*(\hat s) = μ^*(s)\). By construction, \(W^*(s) = \hat W^*(\hat s)\). Thus, \[\begin{align*} W^*(s) &= \hat W^*(\hat s) = c(\hat s, a) + γ \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s' | \hat s, a) \hat W^*(\hat s'). \\ &= c(\hat s, a) + γ \int_{\ALPHABET S} p(s' | \hat s, a) W^*(s') ds'. \end{align*}\] Moreover, \[ V^{μ^*}(s) = c(s, a) + γ \int_{\ALPHABET S} p(s'|s,a) V^{μ^*}(s') ds'. \] Thus, \[\begin{align} \bigl| V^{μ^*}(s) - W^*(s) \bigr| &\stackrel{(a)}\le \bigl| c(s,a) - c(\hat s,a) \bigr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{μ^*}(s') ds' - \int_{\ALPHABET S} p(s'|s,a) V^*(s') ds' \biggr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{*}(s') ds' - \int_{\ALPHABET S} p(s'|\phi(s),a) V^*(s') ds' \biggr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} p(s'|\phi(s),a) V^{*}(s') ds' - \int_{\ALPHABET S} p(s'|\phi(s),a) W^*(s') ds' \biggr| \label{eq:step-c-1} \end{align}\] where \((a)\) follows from the triangle inequality. Now, we bound each of the terms in \eqref{eq:step-b-1}. Since \(c\) is Lipschitz, we have \[\begin{equation} \label{eq:step-c-2} \bigl| c(s,a) - c(\hat s,a) \bigr| \le L_c D. \end{equation}\] Moreover, from triangle inequality, we have \[\begin{equation} \label{eq:step-c-3} \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{μ^*}(s') ds' - \int_{\ALPHABET S} p(s'|s,a) V^*(s') ds' \biggr| \le \NORM{ V^{μ^*} - V^*}_∞. \end{equation}\] From the Kantorovich-Rubinstein duality, we have \[\begin{equation} \label{eq:step-c-5} \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{*}(s') ds' - \int_{\ALPHABET S} p(s'|\phi(s),a) V^*(s') ds' \biggr| \le \mathcal {W}(p( \cdot | s,a), p(\cdot | \phi(s), a)) \NORM{V^*}_L \le L_p D \NORM{V^*}_L. \end{equation}\] Finally, from triangle inequality and Proposition 2, we have \[\begin{equation} \label{eq:step-c-6} \biggl| \int_{\ALPHABET S} p(s'|\phi(s),a) V^{*}(s') ds' - \int_{\ALPHABET S} p(s'|\phi(s),a) W^*(s') ds' \biggr| \le \NORM{V^* - W^*}_∞ \le \frac{D}{1-γ} \NORM{V^*}_L. \end{equation}\] Substituting \eqref{eq:step-c-2}–\eqref{eq:step-c-6} in \eqref{eq:step-c-1} and rearranging, we get \[\begin{equation}\label{eq:step-c-7} \NORM{V^{μ^*} - W^*}_∞ \le D \biggl[ L_c + γ L_p \NORM{V^*}_L + \frac{γ}{1-γ} \NORM{V^*}_L \biggr] + γ \NORM{V^{μ^*} - V^*}_∞. \end{equation}\]

Now, by triangle inequality \[\NORM{V^{μ^*} - V^*}_∞ \le \NORM{V^{μ^*} - W^*}_∞ + \NORM{V^* - W^*}_∞ \le D \biggl[ L_c + γ L_p \NORM{V^*}_L + \frac{1+γ}{1-γ} \NORM{V^*}_L \biggr] + γ \NORM{V^{μ^*} - V^*}. \] where the last inequality follows from Proposition 2 and \eqref{eq:step-c-7}. Rearranging the terms proves the first result of the Proposition. The second result follows from simple algebra.

Alternative Proof

Fix a state \(s \in \ALPHABET S\). Let \(\hat s = \phi(s)\) and \(a = \hat μ^*(\hat s) = μ^*(s)\). By construction, \(W^*(s) = \hat W^*(\hat s)\). Thus, \[\begin{align*} W^*(s) &= \hat W^*(\hat s) = c(\hat s, a) + γ \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s' | \hat s, a) \hat W^*(\hat s'). \\ &= c(\hat s, a) + γ \int_{\ALPHABET S} p(s' | \hat s, a) W^*(s') ds'. \end{align*}\] Moreover, \[ V^{μ^*}(s) = c(s, a) + γ \int_{\ALPHABET S} p(s'|s,a) V^{μ^*}(s') ds'. \] Thus, \[\begin{align} \bigl| V^{μ^*}(s) - W^*(s) \bigr| &\stackrel{(a)}\le \bigl| c(s,a) - c(\hat s,a) \bigr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{μ^*}(s') ds' - \int_{\ALPHABET S} p(s'|s,a) W^*(s') ds' \biggr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} p(s'|s,a) W^{*}(s') ds' - \int_{\ALPHABET S} p(s'|s,a) V^*(s') ds' \biggr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{*}(s') ds' - \int_{\ALPHABET S} p(s'|\phi(s),a) V^*(s') ds' \biggr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} p(s'|\phi(s),a) V^{*}(s') ds' - \int_{\ALPHABET S} p(s'|\phi(s),a) W^*(s') ds' \biggr| \label{eq:step-b-1} \end{align}\] where \((a)\) follows from the triangle inequality. Now, we bound each of the terms in \eqref{eq:step-b-1}. Since \(c\) is Lipschitz, we have \[\begin{equation} \label{eq:step-b-2} \bigl| c(s,a) - c(\hat s,a) \bigr| \le L_c D. \end{equation}\] Moreover, \[\begin{equation} \label{eq:step-b-3} \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{μ^*}(s') ds' - \int_{\ALPHABET S} p(s'|s,a) W^*(s') ds' \biggr| \le \NORM{ V^{μ^*} - W^*}_∞. \end{equation}\] From Proposition 2 we have \[\begin{equation} \label{eq:step-b-4} \biggl| \int_{\ALPHABET S} p(s'|s,a) W^{*}(s') ds' - \int_{\ALPHABET S} p(s'|s,a) V^*(s') ds' \biggr| \le \NORM{ W^* - V^*}_∞ \le \frac{D}{1-γ} \NORM{V^*}_L \end{equation}\] From the Kantorovich-Rubinstein duality, we have \[\begin{equation} \label{eq:step-b-5} \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{*}(s') ds' - \int_{\ALPHABET S} p(s'|\phi(s),a) V^*(s') ds' \biggr| \le \mathcal {W}(p( \cdot | s,a), p(\cdot | \phi(s), a)) \NORM{V^*}_L \le L_p D \NORM{V^*}_L. \end{equation}\] Finally, from Proposition 2, we have \[\begin{equation} \label{eq:step-b-6} \biggl| \int_{\ALPHABET S} p(s'|\phi(s),a) V^{*}(s') ds' - \int_{\ALPHABET S} p(s'|\phi(s),a) W^*(s') ds' \biggr| \le \NORM{V^* - W^*}_∞ \le \frac{D}{1-γ} \NORM{V^*}_L. \end{equation}\] Substituting \eqref{eq:step-b-2}–\eqref{eq:step-b-6} in \eqref{eq:step-b-1} and rearranging, we get \[\begin{equation} \NORM{V^{μ^*} - W^*}_∞ \le \frac{D}{1-γ} \biggl[ L_c + γ L_p \NORM{V^*}_L + \frac{2γ}{1-γ} \NORM{V^*}_L \biggr] \end{equation}\]

This proves the first result of the Proposition. The second result follows from simple algebra.

3 Variations of a theme

\(\newcommand\red[1]{{\color{red} #1}}\) In the discussion above, we assumed that the approximate model \((\hat {\ALPHABET S}, \ALPHABET A, \hat P, \hat c)\) was defined as follows: \[ \hat c(\hat s, a) = c(s, a) \quad\text{and}\quad \hat P(\hat s' | \hat s, a) = p(\phi^{-1}(\hat s') | \hat s, a). \]

A slightly general method of defining the approximate model is as follows. Consider a weight function \(α \colon \ALPHABET S \to [0,1]\) which satisfies the following property: \[\begin{equation}\label{eq:property} \int_{\phi^{-1}(\hat s)} α(s) ds = 1, \quad \forall \hat s \in \hat {\ALPHABET S}. \end{equation}\] For example, such a weight function can be defined via an arbitrary measure \(λ\) on \(\ALPHABET S\) by viewing \(α(s)\) as the conditional probability distribution \[ α(s) = \dfrac{ λ(s) }{ λ(\phi^{-1}(\phi(s))) }, \] which satisfies \eqref{eq:property} by construction.

Now given a weight function which satisfies \eqref{eq:property}, we can define an approximate model \((\hat {\ALPHABET S}, \ALPHABET A, \hat P, \hat c)\) as follows: for all \(\hat s \in \hat {\ALPHABET S}\) and \(a \in \ALPHABET A\),

\[\hat c(\hat s, a) = \int_{\phi^{-1}(\hat s)} c(s,a) α(s) ds \quad\text{and}\quad \hat P(\hat s' | \hat s, a) = \int_{\phi^{-1}(s)} p(\phi^{-1}(\hat s') | s, a) α(s) ds. \] Note that if we take \(α(s) = \sum_{\hat s \in \hat {\ALPHABET S}}δ(s - \hat s)\), then we recover the approximation function used in the previous section.

Using a weighed approximate model does not fundamentally change any of the results. Here we rederive all the previous results for the weighted approximate model.

3.1 Preliminary results

As before, we start by some preliminary results to build the intuition behind the approximation bounds. We start with a property of the discretized transition matrix.

Lemma 3

For any \(\hat V \colon \hat {\ALPHABET S} \to \reals\), let \(V \colon \ALPHABET S \to \reals\) be its piecewise constant extrapolation from \(\hat {\ALPHABET S}\) to \(\ALPHABET S\) (i.e., \(V = \hat V \circ \phi\)). Then, for any \(\hat s_k \in \ALPHABET S\) and \(a \in \ALPHABET A\), we have \[ \int_{\ALPHABET S} \red{\int_{\hat {\ALPHABET S}_k}} p(s' | \hat s_k,a) V(s') \red{α(s) ds} ds' = \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s' | \hat s_k, a) \hat V(\hat s'). \]

Proof

Recall that \(\{\ALPHABET S_1, \dots, \ALPHABET S_n\}\) is a partition of \(\ALPHABET S\) and \(\ALPHABET S_i = \phi^{-1}(\hat s_i)\). Therefore, \[\begin{align*} \int_{\ALPHABET S} \red{\int_{\hat {\ALPHABET S}_k}} p(s' | s,a) V(s') \red{α(s) ds} ds' &= \sum_{\hat s'_i \in \hat {\ALPHABET S}} \int_{\hat {\ALPHABET S}_i} \biggl[ \red{\int_{\hat{\ALPHABET S}_k}} p(s' | \hat s,a) \red{α(s) ds}\biggr] V(s') ds' \\ &= \sum_{\hat s' \in \hat {\ALPHABET S}}\hat V(\hat s'_i) \biggl[\int_{\hat {\ALPHABET S}_i} \red{\int_{\hat{\ALPHABET S}_k}} p(s' | s,a) \red{α(s) ds}ds' \biggr] \\ &= \sum_{\hat s' \in \hat {\ALPHABET S}} \hat V(\hat s') \hat P(\hat s' | \hat s, a). \end{align*}\]

As before, an immediate consequence of Lemma 3 is the following:

Lemma 4

For any \(\hat V \colon \hat {\ALPHABET S} \to \reals\), let \(V \colon \ALPHABET S \to \reals\) be its piecewise constant extrapolation from \(\hat {\ALPHABET S}\) to \(\ALPHABET S\) (i.e., \(V = \hat V \circ \phi\)). As in Lemma 2, define the one-step update functions: \[\begin{align} W(s) &= \min_{a \in \ALPHABET A}\biggl\{ c(s,a) + γ \int_{\ALPHABET S} p(s' | s,a) V(s') ds' \biggr\}, \label{eq:one-step-a-w}\\ \hat W(\hat s) &= \min_{a \in \ALPHABET A} \biggl\{ \hat c(\hat s,a) + γ \sum_{\hat s \in \ALPHABET S} \hat P(\hat s' | s,a) \hat V(\hat s') \biggr\}. \label{eq:one-step-b-w} \end{align}\] Then, \[\begin{equation}\label{eq:one-step-w} \red{\int_{\phi^{-1}(\hat s)}} W(s) \red{α(s) ds} = \hat W(\hat s), \quad \forall \hat s \in \hat {\ALPHABET S}. \end{equation}\]

Proof

Let \(π\) be the optimal policy for \eqref{eq:one-step-a-w} and \(\hat π\) be the optimal policy for \eqref{eq:one-step-b-w}. Fix a state \(\hat s_k \in \hat {\ALPHABET S}\) and let \(a = π(\hat s_k)\) and \(\hat a = \hat π(\hat s_k)\). Then, \[\begin{align*} \red{\int_{\hat {\ALPHABET S}_k}} W(s) \red{α(s) ds} &= \red{\int_{\hat {\ALPHABET S}_k}} c(s, a) \red{α(s) ds} + γ \int_{\ALPHABET S} \red{\int_{\hat {\ALPHABET S}_k}} p( s' | s, a) V(s') \red{α(s) ds} ds' \\ &\stackrel{(a)}= \hat c(\hat s_k, a) + γ \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s'|\hat s_k, a) \hat V(\hat s') \\ &\ge \hat W(\hat s_k), \end{align*}\] where \((a)\) follows from definition of \(\hat c\) and Lemma 3.

Similarly, we have \[\begin{align*} \hat W(\hat s_k) &= \hat c(\hat s_k, \hat a) + γ \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s'|\hat s_k, \hat a) \hat V(\hat s') \\ &\stackrel{(b)}= \red{\int_{\hat {\ALPHABET S}_k}} c(s, \hat a) \red{α(s) ds} + γ \int_{\ALPHABET S} \red{\int_{\hat {\ALPHABET S}_k} } P( s' | s, \hat a) V(s') \red{α(s) ds} ds' \\ &\ge \red{\int_{\hat {\ALPHABET S}_k}} W(s) \red{α(s) ds}, \end{align*}\] where \((b)\) follows from definition of \(\hat c\) and Lemma 3.

Thus, \(\int_{\hat {\ALPHABET S}_k} W(s) α(s) ds = \hat W(\hat s_k)\).

The interpretations of Lemma 3 and Lemma 4 are similar to those of Lemma 1 and Lemma 2. The one step Bellman update of any cost-to-go function \(\hat V \colon \ALPHABET S \to \reals\) in the approximate model is the same as one-step Bellman update of its piecewise constant extrapolation \(V\) averaged using the weight function \(α\).

3.2 Bounding the error for value function approximation

The intuition of bounding the value function approximation is the same as before. We assume that \(\ALPHABET S\) is a bounded metric space with metric \(d_{\ALPHABET S}\) and the orignal MDP satisfies regularity conditions (such as Assumpt. 1) such that the value function \(V^*\) is Lipschitz with Lipschitz constant \(\NORM{V^*}_L\).

Then, as for Proposition 2, we can show the following.

Proposition 4

Proposition 2 holds for the weighted approximate model as well.

Proof

Consider any \(s \in \ALPHABET S\). Define the piecewise constant function \[ \bar V(s) = \int_{\phi^{-1}( \phi(s) )} V^*(s) α(s) ds. \] For the ease of notation, let \(\hat {\ALPHABET S}_k\) denote the code cell where \(s\) lies. Then, we can write \(\bar V(s) = \int_{\hat{\ALPHABET S}_k} V^*(s) α(s) ds\) for all \(s \in \hat {\ALPHABET S}_k\). Note that, by Lipschitz continuity of \(V^*\), we have for any \(s \in \hat {\ALPHABET S}_k\), \[\begin{align} \bigl| V^*(s) - \bar V(s) \bigr| &= \biggl| \int_{\hat {\ALPHABET S}_k} V^*(s) α(s') ds' - \int_{\hat {\ALPHABET S}_k} V^*(s') α(s') ds' \biggr| \notag \\ &\le \int_{\hat {\ALPHABET S}_k} \bigl| V^*(s) - V^*(s') \bigr| α(s') ds' \notag \\ &\stackrel{(a)}{\le} \int_{\hat {\ALPHABET S}_k} \NORM{V^*}_L D α(s') ds' \notag \\ &= \NORM{V^*}_L D \label{eq:lip-w} \end{align}\] where \((a)\) follows from \eqref{eq:lip}. Note that since the code cell \(\hat {\ALPHABET S}_k\) was arbitrary, \eqref{eq:lip-w} holds for all \(s \in \ALPHABET S\). Thus, \eqref{eq:lip-w} may be viewed as the analog of \eqref{eq:lip} for the weighted model.

Now consider, \[\begin{align} \bigl| V^*(s) - W^*(s) \bigr| &\le \bigl| V^*(s) - \bar V(s) \bigr| + \bigl| \bar V(s) - W^*(s) \bigr| \notag \\ &\le \NORM{V^*}_L D + \bigl| \bar V(s) - W^*(s) \bigr| \label{eq:step-a-1-w} \end{align}\] where the first term follows from \eqref{eq:lip-w}.

Now we bound the second term. For the ease of notation let \(\mathcal B\) and \(\hat {\mathcal B}\) denote the Bellman operators for model \(M\) and \(\hat M\), respectively. Without loss of generality, we assume that \(s \in \hat {\ALPHABET S}_k\). Then, \[ \bar V(s) = \int_{\hat {\ALPHABET S}_k} V^*(s) α(s) ds = \int_{\hat {\ALPHABET S}_k} [ \mathcal B V^*](s) α(s) ds.\] Furthermore, \[ W^*(s) = \hat W^*(\hat s_k) = [\hat {\mathcal B} \hat W^*](\hat s_k) = \int_{\hat {\ALPHABET S}_k} [\mathcal B W^* ](s) α(s) ds,\] where the last equality follows from Lemma 4.

Combining the above two equations, we have \[\begin{align} \bigl| \bar V(s) - W^*(s) \bigr| &\le \int_{\hat {\ALPHABET S}_k} \bigl| [\mathcal B V^*](s) - [\mathcal B W^*](s) \bigr| α(s) ds \notag \\ &\le γ \NORM{V^* - W^*}_∞ \label{eq:step-a-2-w} \end{align}\]

Substituting \eqref{eq:step-a-2-w} back in \eqref{eq:step-a-1-w}, we get \[\bigl| V^*(s) - W^*(s) \bigr| \le \NORM{V^*}_L D + γ \NORM{V^* - W^*}_∞.\] We get the result by supermizing over \(s\) and rearranging the terms.

Note that since Corollary 1 only relies on the bound on \(\NORM{V^*}_L\) it is also applicable to the weighted approximate model.

3.3 Bounding the error for policy approximation

The bound and the proof for policy approximation goes through with very minor changes.

Proposition 5

Proposition 3 holds for the weighted approximate model as well.

Proof

We will follow the alternative proof of Proposition 3. Fix a state \(s \in \ALPHABET S\) and suppose \(s \in \hat {\ALPHABET S}_k\). Then \(\phi(s) = \hat s_k\). Let \(a = \hat μ^*(\hat s_k) = μ^*(s)\). By construction, \(W^*(s) = \hat W^*(\hat s_k)\). Thus, \[\begin{align*} W^*(s) &= \hat W^*(\hat s_k) = \hat c(\hat s_k, a) + γ \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s' | \hat s, a) \hat W^*(\hat s'). \\ &= \red{\int_{\hat {\ALPHABET S}_k}}c(\tilde s, a) \red{α(\tilde s) d\tilde s} + γ \int_{\ALPHABET S} \red{\int_{\hat {\ALPHABET S}_k}}p(s' | \tilde s, a) W^*(s') \red{α(\tilde s) d\tilde s}ds'. \end{align*}\] Moreover, \[ V^{μ^*}(s) = c(s, a) + γ \int_{\ALPHABET S} p(s'|s,a) V^{μ^*}(s') ds'. \] Thus, \[\begin{align} \bigl| V^{μ^*}(s) - W^*(s) \bigr| &\stackrel{(a)}\le \biggl| c(s,a) - \red{\int_{\hat {\ALPHABET S}_k}}c(\tilde s,a) \red{α(\tilde s) d\tilde s}\biggr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{μ^*}(s') ds' - \int_{\ALPHABET S} p(s'|s,a) W^*(s') ds' \biggr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} p(s'|s,a) W^{*}(s') ds' - \int_{\ALPHABET S} p(s'|s,a) V^*(s') ds' \biggr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{*}(s') ds' - \int_{\ALPHABET S}\red{\int_{\hat {\ALPHABET S}_k}} p(s'|\tilde s,a) V^*(s') \red{α(\tilde s) d\tilde s}ds' \biggr| \notag \\ &\quad + γ \biggl| \int_{\ALPHABET S} \red{\int_{\hat {\ALPHABET S}_k}}p(s'|\tilde s,a) V^{*}(s')\red{α(\tilde s) d\tilde s} ds' - \int_{\ALPHABET S} \red{\int_{\hat {\ALPHABET S}_k}}p(s'|\tilde s,a) W^*(s')\red{α(\tilde s) d\tilde s} ds' \biggr| \label{eq:step-b-1-w} \end{align}\] where \((a)\) follows from the triangle inequality. Now, we bound each of the terms in \eqref{eq:step-b-1-w}. Since \(c\) is Lipschitz and the weight function \(α\) satisfies \eqref{eq:property}, we have \[\begin{equation} \label{eq:step-b-2-w} \biggl| c(s,a) - \red{\int_{\hat {\ALPHABET S}_k}}c(\tilde s,a) \red{α(\tilde s) d\tilde s}\biggr| \le \red{\int_{\hat {\ALPHABET S}_k}} \bigl| c(s,a) - c(\tilde s,a) \bigr| \red{α(\tilde s) d\tilde s}\le L_c D. \end{equation}\] The next two terms can be bound as before. Fron triangle inequality, we have \[\begin{equation} \label{eq:step-b-3-w} \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{μ^*}(s') ds' - \int_{\ALPHABET S} p(s'|s,a) W^*(s') ds' \biggr| \le \NORM{ V^{μ^*} - W^*}_∞. \end{equation}\] From Proposition 4 we have \[\begin{equation} \label{eq:step-b-4-w} \biggl| \int_{\ALPHABET S} p(s'|s,a) W^{*}(s') ds' - \int_{\ALPHABET S} p(s'|s,a) V^*(s') ds' \biggr| \le \NORM{ W^* - V^*}_∞ \le \frac{D}{1-γ} \NORM{V^*}_L \end{equation}\] From the Kantorovich-Rubinstein duality, we have \[\begin{align} \hskip 2em & \hskip -2em \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{*}(s') ds' - \int_{\ALPHABET S}\red{\int_{\hat {\ALPHABET S}_k}} p(s'|\tilde s,a) V^*(s') \red{α(\tilde s) d\tilde s}ds' \biggr| \notag \\ &\le \red{\int_{\hat {\ALPHABET S}_k}} \biggl| \int_{\ALPHABET S} p(s'|s,a) V^{*}(s') ds' - \int_{\ALPHABET S} p(s'|\tilde s,a) V^*(s') ds' \biggr| \red{α(\tilde s) d\tilde s} \notag \\ &\le \mathcal {W}(p( \cdot | s,a), p(\cdot | \phi(s), a)) \NORM{V^*}_L \notag \\ &\le L_p D \NORM{V^*}_L. \label{eq:step-b-5-w} \end{align}\] Finally, from Proposition 4, we have \[\begin{equation} \label{eq:step-b-6-w} \biggl| \int_{\ALPHABET S} \red{\int_{\hat {\ALPHABET S}_k}}p(s'|\tilde s,a) V^{*}(s')\red{α(\tilde s) d\tilde s} ds' - \int_{\ALPHABET S} \red{\int_{\hat {\ALPHABET S}_k}}p(s'|\tilde s,a) W^*(s')\red{α(\tilde s) d\tilde s} ds' \biggr| \le \NORM{V^* - W^*}_∞ \le \frac{D}{1-γ} \NORM{V^*}_L. \end{equation}\] Substituting \eqref{eq:step-b-2}–\eqref{eq:step-b-6} in \eqref{eq:step-b-1} and rearranging, we get \[\begin{equation} \NORM{V^{μ^*} - W^*}_∞ \le \frac{D}{1-γ} \biggl[ L_c + γ L_p \NORM{V^*}_L + \frac{2γ}{1-γ} \NORM{V^*}_L \biggr] \end{equation}\]

This proves the first result of the Proposition. The second result follows from simple algebra.

References

The results on state discretization (or quantization) first appeared in Bertsekas (1975). The material here is adapted from Bertsekas (1975) and Hinderer (2005). The alternative proof of Proposition 3 is from Kara et al. (2021).

The generalization to a weighted model is based on Li et al. (2006). Similar model is also considered in Kara et al. (2021).

Bertsekas, D. 1975. Convergence of discretization procedures in dynamic programming. IEEE Transactions on Automatic Control 20, 3, 415–419. DOI: 10.1109/TAC.1975.1100984.
Hinderer, K. 2005. Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research 62, 1, 3–22. DOI: 10.1007/s00186-005-0438-1.
Kara, A.D., Saldi, N., and Yüksel, S. 2021. Q-learning for MDPs with general spaces: Convergence and near optimality via quantization under weak continuity. Available at: https://arxiv.org/abs/2111.06781v1.
Li, L., Walsh, T.J., and Littman, M.L. 2006. Towards a unified theory of state abstraction for MDPs. ISAIM. Available at: http://anytime.cs.umass.edu/aimath06/proceedings/P21.pdf.

This entry was last updated on 16 Jun 2022 and posted in MDP and tagged infinite horizon, discounted cost, lipschitz continuity, approximation bounds, state aggregation.