25 Model approximation

Author

Affiliation

Updated

January 30, 2025

Keywords

infinite horizon, discounted cost, Lipschitz continuity, approximation bounds, state aggregation

There are many instances where approximate models are used. For example, in many applications with large state spaces, one often construct a simulation model of the system and use it to identify an optimal policy using simulation based methods (such as reinforcement learning). Often, the simulation model is only an approximation of the true model. In such instances, we want to know the error in using the policy obtained from the simulation model in the real world. In this section, we present bounds on such sim-to-real transfer.

Consider an MDP \(\ALPHABET M = \langle \ALPHABET S, \ALPHABET A, P, c, γ \rangle\). Suppose the components \(\langle \ALPHABET S, \ALPHABET A, γ \rangle\) are known exactly but the components \((P,c)\) are known approximately. Consider the approximate MDP \(\widehat {\ALPHABET M} = \langle \ALPHABET S, \ALPHABET A, \hat P, \hat c, γ \rangle\). We will call \(\ALPHABET M\) to be the true model and \(\widehat {\ALPHABET M}\) to be the approximate model.

Let \(V^*\) and \(\hat V^*\) denote the optimal value functions of the true model \(\ALPHABET M\) and the approximate model \(\widehat {\ALPHABET M}\), respectively. Moreover, let \(π^*\) and \(\hat π^*\) be optimal policies for the true model \(\ALPHABET M\) and the approximate model \(\widehat {\ALPHABET M}\), respectively.

We are interested in the following questions:

Policy error bounds: Given a policy \(π\), what is the error if \(\hat V^π\) is used as an approximation for \(V^π\)?
Value error bounds: What is the error if \(\hat V^*\) is used as an approximation for \(V^*\)?
Model approximation error: What is the error if the policy \(\hat π^*\) is used instead of the optimal policy \(π^*\)

25.1 Bounds for model approximation

25.1.1 Policy and value error bounds

Let \(\BELLMAN^π\) and \(\BELLMAN^*\) denote the Bellman operator for policy \(π\) and the optimality Bellman operator for model \(\ALPHABET M\). Let \(\hat {\BELLMAN}^π\) and \(\hat {\BELLMAN}^*\) denote the corresponding quantities for model \(\widehat {\ALPHABET M}\). Define the Bellman mismatch functionals \(\MISMATCH^π\) and \(\MISMATCH^*\) as follows: \[\begin{align*} \MISMATCH^π v &= \| \BELLMAN^π v - \hat {\BELLMAN}^π v \|_∞, \\ \MISMATCH^* v &= \| \BELLMAN^* v - \hat {\BELLMAN}^* v \|_∞ . \end{align*}\]

Also define the maximum Bellman mismatch as \[\begin{align*} \MISMATCH^{\max} v &= \max_{(s,a) \in \ALPHABET S, A} \biggl\lvert c(s,a) + γ \sum_{s' \in \ALPHABET S} P(s'|s,a)v(s') \notag \\ & \hskip 6em -\hat c(s,a) - γ \sum_{s' \in \ALPHABET S} \hat P(s'|s,a) v(s') \biggr\rvert. \end{align*}\]

Lemma 25.1 The following inequalities hold:

\(\sup_{π \in Π} \MISMATCH^π v = \MISMATCH^{\max} v\)
\(\MISMATCH^* v \le \MISMATCH^{\max} v\).

The Bellman mismatch functional can be used to bound the performance difference of a policy between the true and approximate models.

Proposition 25.1 (Policy error) For any (possibly randomized) policy \(π\), \[\begin{equation}\label{eq:policy-error} \| V^{π} - \hat V^{π} \|_∞ \le \frac{1}{1-γ} \min\{ \MISMATCH^π V^{π}, \MISMATCH^π \hat V^{π} \}. \end{equation}\]

Proof

We bound the left hand side of \eqref{eq:policy-error} in two ways. The first way is as follows: \[\begin{align} \| V^{π} - \hat V^{π} \|_∞ &= \| \BELLMAN^π V^π - \hat {\ALPHABET B}^π \hat V^π \|_∞ \notag \\ &\le \| \BELLMAN^π V^π - \hat {\ALPHABET B}^π V^π \|_∞ + \| \hat {\BELLMAN}^π V^π - \hat {\ALPHABET B}^π \hat V^π \|_∞ \notag \\ &\le \MISMATCH^π V^π + γ \| V^π - \hat V^π \|_∞ \label{eq:ineq-3} \end{align}\] where the first inequality follows from the triangle inequality, and the second inequality follows from the definition of the Bellman mismatch functional and the contraction property of Bellman operators. Rearranging terms in \eqref{eq:ineq-3} gives us \[\begin{equation} \| V^{π} - \hat V^{π} \|_∞ \le \frac{ \MISMATCH^π V^{π}}{1 - γ}. \label{eq:ineq-4}\end{equation}\] This gives the first bound.

The second bound is symmetric and obtained by interchanging the roles of \(V^π\) and \(\hat V^π\). \[\begin{align} \| V^{π} - \hat V^{π} \|_∞ &= \| \BELLMAN^π V^π - \hat {\ALPHABET B}^π \hat V^π \|_∞ \notag \\ &\le \| \BELLMAN^π V^π - \ALPHABET B^π \hat V^π \|_∞ + \| \BELLMAN^π \hat V^π - \hat {\ALPHABET B}^π \hat V^π \|_∞ \notag \\ &\le γ \| V^π - \hat V^π \|_∞ + \MISMATCH^π \hat V^π \label{eq:ineq-13} \end{align}\] Rearranging terms in \eqref{eq:ineq-13} gives us \[\begin{equation} \| V^{π} - \hat V^{π} \|_∞ \le \frac{ \MISMATCH^π \hat V^{π}}{1 - γ}. \label{eq:ineq-14}\end{equation}\] This gives the second bound.

Similar to the above, we can also bound the difference between the optimal value functions of the true and approximate models.

Proposition 25.2 (Value error) Let \(V^*\) and \(\hat V^*\) denote the optimal value functions for \(\ALPHABET M\) and \(\widehat {\ALPHABET M}\), respectively. Then, \[\begin{equation}\label{eq:value-error} \| V^* - \hat V^* \|_∞ \le \frac{1}{1-γ} \min\{ \MISMATCH^* V^*, \MISMATCH^* \hat V^* \} \end{equation}\]

Proof

The proof argument is almost the same as the proof argument for Proposition 25.1. The first is as follows: \[\begin{align} \| V^{*} - \hat V^{*} \|_∞ &= \| \BELLMAN^* V^* - \hat {\ALPHABET B}^* \hat V^* \|_∞ \notag \\ &\le \| \BELLMAN^* V^* - \hat {\ALPHABET B}^* V^* \|_∞ + \| \hat {\BELLMAN}^* V^* - \hat {\ALPHABET B}^* \hat V^* \|_∞ \notag \\ &\le \MISMATCH^* V^* + γ \| V^* - \hat V^* \|_∞ \label{eq:ineq-1} \end{align}\] where the first inequality follows from the triangle inequality, and the second inequality follows from the definition of the Bellman mismatch functional and the contraction property of Bellman operators. Rearranging terms in \eqref{eq:ineq-1} gives us \[\begin{equation} \| V^* - \hat V^* \|_∞ \le \frac{ \MISMATCH^* V^*}{1 - γ}. \label{eq:ineq-2}\end{equation}\] This gives the first bound.

The second bound is symmetric and obtained by interchanging the roles of \(V^*\) and \(\hat V^*\). \[\begin{align} \| V^{*} - \hat V^{*} \|_∞ &= \| \BELLMAN^* V^* - \hat {\ALPHABET B}^* \hat V^* \|_∞ \notag \\ &\le \| \BELLMAN^* V^* - \ALPHABET B^* \hat V^* \|_∞ + \| \BELLMAN^* \hat V^* - \hat {\ALPHABET B}^* \hat V^* \|_∞ \notag \\ &\le γ \| V^* - \hat V^* \|_∞ + \MISMATCH^* \hat V^* \label{eq:ineq-11} \end{align}\] Rearranging terms in \eqref{eq:ineq-11} gives us \[\begin{equation} \| V^{*} - \hat V^{*} \|_∞ \le \frac{ \MISMATCH^* \hat V^{*}}{1 - γ}. \label{eq:ineq-12}\end{equation}\] This gives the second bound.

25.1.2 Model approximation error

To bound the model error, we observe that from triangle inequality we have \[\begin{equation} \label{eq:triangle-1} \| V^* - V^{\hat π^*} \|_∞ \le \| V^* - \hat V^{\hat π^*} \|_∞ + \| V^{\hat π^*} - \hat V^{\hat π^*} \|_∞. \end{equation}\]

Proposition 25.1 and Proposition 25.2 provide bounds of both of the terms of \(\eqref{eq:triangle-1}\). Choosing appropriate values for both terms gives us the following

Theorem 25.1 (Model approximation error) The policy \(\hat π^*\) is an \(α\)-optimal policy of \(\ALPHABET M\) where \[ α := \| V^* - V^{\hat π^*} \|_∞ \le \frac{1}{1-γ} \bigl[ \MISMATCH^* \hat V^* + \MISMATCH^{\hat π^*} \hat V^* \bigr]. \] Moreover, since \(\MISMATCH^{\max} \hat V^*\) is an upper bound for both \(\MISMATCH^{\hat π^*} \hat V^*\) and \(\MISMATCH^* \hat V^*\), we have \[ α \le \frac{2}{(1-γ)} \MISMATCH^{\max} \hat V^*. \]

In some applications, it is useful to have a bound on model approximation error that depends on \(V^*\) rather than \(\hat V^*\). We provide such a bound below.

Theorem 25.2 (Model approximation error) The policy \(\hat π^*\) is an \(α\)-optimal policy of \(\ALPHABET M\) where \[ α := \| V^* - V^{\hat π^*} \|_∞ \le \frac{1}{1-γ} \MISMATCH^{\hat π^*} V^* + \frac{(1+γ)}{(1-γ)^2} \MISMATCH^* V^* . \]

Moreover, since \(\MISMATCH^{\max} V^*\) is an upper bound for both \(\MISMATCH^{\hat π^*} V^*\) and \(\MISMATCH^* V^*\), we have \[ α \le \frac{2}{(1-γ)^2} \MISMATCH^{\max} V^*. \]

Proof

We bound the first term of \(\eqref{eq:triangle-1}\) by Proposition 25.2 But instead of bounding the second term of \(\eqref{eq:triangle-1}\) by Proposition 25.1, we consider the following: \[\begin{align} \| V^{\hat π^*} - \hat V^{\hat π^*} \|_∞ &= \| V^{\hat π^*} - \hat V^{*} \|_∞ = \| \BELLMAN^{\hat π^*} V^{\hat π^*} - \hat {\BELLMAN}^{\hat π^*} \hat V^{*} \|_∞ \notag \\ &\le \| \BELLMAN^{\hat π^*} V^{\hat π^*} - \BELLMAN^{\hat π^*} V^{*} \|_∞ + \| \BELLMAN^{\hat π^*} V^{*} - \hat {\BELLMAN}^{\hat π^*} V^{*} \|_∞ + \| \hat {\BELLMAN}^{\hat π^*} V^{*} - \hat {\BELLMAN}^{\hat π^*} \hat V^{*} \|_∞ \notag \\ &\le γ \| V^* - V^{\hat π^*} \|_∞ + \MISMATCH^{\hat π^*} V^* + γ \| V^* - \hat V^* \|_∞ \label{eq:ineq-21}. \end{align}\] where the first inequality follows from the triangle inequality and the second inequality follows from the definition of Bellman mismatch functional and contraction property of Bellman operator.

Substituting \(\eqref{eq:ineq-21}\) in \(\eqref{eq:triangle-1}\) and rearranging terms, we get \[\begin{align} \| V^* - V^{\hat π^*} \|_∞ &\le \frac{1}{1-γ} \MISMATCH^{\hat π^*} V^* + \frac{1+γ}{1-γ} \| V^* - \hat V^* \|_∞ \notag \\ &\le \frac{1}{1-γ} \MISMATCH^{\hat π^*} V^* + \frac{(1+γ)}{(1-γ)^2} \MISMATCH^* V^* . \end{align}\] where the second inequality follows from Proposition 25.2.

Remark:

Note that the bound of Theorem 25.1 is tighter by a factor of \(1/(1-γ)\) but that bound is in terms of \(\hat V^*\). In some settings, a bound in terms of \(V^*\) is more desirable. Using Theorem 25.2 in such settings leads to scaling by \(1/(1-γ)\).

Comparison with policy error bound

It is interesting to compare the bound of Theorem 25.2 with the policy loss error derived in Theorem 24.1. In particular, instead of using policy \(\hat π^*\), suppose we use the policy \(μ = \GREEDY(\hat V^*)\), which is the greedy policy (in the true model) w.r.t. \(\hat V^*\). From Theorem 24.1, we know that \[ \NORM{V^* - V^{μ}}_{∞} \le \frac{2 γ}{1 - γ} \NORM{V^* - \hat V^*}_{∞} \le \frac{2 γ}{(1-γ)^2} \MISMATCH^{\max} \hat V^* \] where we have used Proposition 25.2 and Lemma 25.1 for the last inequality. In contast, the bound of Theorem 25.2 is tighter by a factor of \(1/(1-γ)\). In principle, this policy should be better than \(\hat π^*\). The above comparison shows that it may be possible to tighten the policy error loss in Theorem 24.1.

Note that when we use the upper bounds in terms of \(\MISMATCH^{\max} V^*\), then the two bounds match in \(1/(1-γ)\) factors. In particular, \[ \NORM{V^* - V^{μ}}_{∞} \le \frac{2 γ}{1 - γ} \NORM{V^* - \hat V^*}_{∞} \le \frac{2 γ}{(1-γ)^2} \MISMATCH^{\max} V^* \] which is slightly tighter than the bound in Theorem 25.2 on \(\NORM{V^* - V^{\hat π^*}}_{∞}\).

25.2 IPM based bounds on model approximation error

Sometimes, it is easier to think in terms of explicit bounds between the models. For example, we may characterize the error between models \(\ALPHABET M\) and \(\hat {\ALPHABET M}\) as the distance between their cost functions and transition dynamics. For formalize this notion, we need to specify a metric on probability spaces. It turns out that integral probability metrics (IPM) are ideally suited for this task.

Let \(\def\F{\mathfrak{F}}\F\) be a convex and balanced set of functions from \(\ALPHABET S\) to \(\reals\). Then, the IPM distance (w.r.t. \(\F\)) between two probability laws \(μ_1\) and \(μ_2\) is given by \[ d_{\F}(ν_1, ν_2) = \sup_{f \in \F} \left| \int f d μ_1 - \int f d μ_2 \right|. \] For our discussion, we will assume that \(\F\) is a maximal generator. See the notes on IPM for more details, in particular the notion of gauge or Minkowski functional \(ρ_{\F}\) of an IPM and Proposition 41.4, which states that for any function \(f\), \[\begin{equation}\label{eq:IPM-ineq} \left| \int f d μ_1 - \int f d μ_2 \right| \le ρ_{\F}(f) d_{\F}(μ_1, μ_2). \end{equation}\]

Now, we define a notion of distance between models.

Definition 25.1 Given a function class \(\F\), we say that a model \(\hat {\ALPHABET M}\) is an \((ε,δ)\)-approximation of model \(\ALPHABET M\) if for all \((s,a) \in \ALPHABET S × \ALPHABET A\), we have:

\(\ABS{ c(s,a) - \hat c(s,a) } \le ε\)
\(d_{\F}( P(\cdot \mid s,a) , \hat P(\cdot \mid s,a) ) \le δ\).

Note that given any two models \(\ALPHABET M\) and \(\hat {\ALPHABET M}\), we can always say that \(\hat {\ALPHABET M}\) is an \((ε,δ)\) approximation of \(\ALPHABET M\) with \[ ε = \NORM{ c - \hat c }_{∞} \quad\text{and}\quad δ = \sup_{(s,a) \in \ALPHABET S × \ALPHABET A} d_{\F}( P(\cdot \mid s,a) , \hat P(\cdot \mid s,a) ). \]

An immediate implication of the above definition is the following.

Lemma 25.2 If \(\hat {\ALPHABET M}\) is an \((ε,δ)\) approximation of \(\ALPHABET M\) with respect to \(\F\), then for any \(v \colon \ALPHABET S \to \reals\) \[ \MISMATCH^{\max} v \le ε + γ δ \, ρ_{\F}(v). \]

Proof

From the definition of maximum Bellman mismatch, we have \[\begin{align*} \MISMATCH^{\max} v &= \max_{(s,a) \in \ALPHABET S, A} \biggl\lvert c(s,a) + γ \sum_{s' \in \ALPHABET S} P(s'|s,a)v(s') -\hat c(s,a) - γ \sum_{s' \in \ALPHABET S} \hat P(s'|s,a) v(s') \biggr\rvert \\ &\stackrel{(a)}\le \max_{(s,a) \in \ALPHABET S, A} \bigg\{ \bigl\lvert c(s,a) - \hat c(s,a) \bigr\rvert + γ \biggl\lvert \sum_{s' \in \ALPHABET S} P(s' | s,a) v(s') - \sum_{s' \in \ALPHABET S} \hat P(s' | s,a) v(s') \biggr| \biggr\} \\ &\stackrel{(b)}\le \max_{(s,a) \in \ALPHABET S, A} \Bigl\{ \bigl\lvert c(s,a) - \hat c(s,a) \bigr\rvert + γ ρ_{\F}(v) d_{\F}(P(\cdot | s,a), \hat P(\cdot | s,a) ) \Bigr\} \notag \\ &\stackrel{(c)}\le ε + γ δ ρ_{\F}(v) \end{align*}\] where \((a)\) follows from triangle inequality, \((b)\) follows from \(\eqref{eq:IPM-ineq}\) and \((c)\) follows from the definition of \((ε,δ)\).

An immediate consequence of Lemma 25.2 when combined with Theorem 25.2 and Theorem 25.1 is the following.

Theorem 25.3 (Model approximation error) If model \(\hat {\ALPHABET M}\) is an \((ε,δ)\)-approximation of model \(\ALPHABET M\), then the policy \(\hat π^*\) is an \(α\)-optimal policy of \(\ALPHABET M\) where \[ α := \| V^* - V^{\hat π^*} \|_∞ \le \frac{2}{1-γ} \bigl[ ε + γ δ ρ_{\F}(\hat V^*)\bigr]. \] Another upper bound on \(α\) is \[ α \le \frac{2}{(1-γ)^2} \bigl[ ε + γ δ ρ_{\F}(V^*)\bigr]. \]

Note that the above bounds require the knowledge of \(\hat V^*\). For specific choices of IPM, it is possible to obtain bounds which do not require the knowledge of \(\hat V^*\). We can get looser upper bounds which do not require explicit knowledge of \(\hat V^*\).

Corollary 25.1 (Instance independent model approximation error bounds)

If model \(\hat {\ALPHABET M}\) is an \((ε,δ)\)-approximation of model \(\ALPHABET M\) with respect to total variation, then the policy \(\hat π^*\) is an \(α\)-optimal policy of \(\ALPHABET M\) where \[ α \le \frac{2 ε}{1-γ} + \frac{ γ δ\SPAN(\hat r)}{(1-γ)^2}, \] or \[ α \le \frac{2 ε}{(1-γ)^2} + \frac{ γ δ\SPAN(r)}{(1-γ)^3}. \]
If the approximation is with respect the the Wasserstein distance, then we have the following:
1. If the approximate model \(\hat {\ALPHABET M}\) is \((\hat L_r, \hat L_P)\) Lipschitz (see Definition 19.1) with \(γ \hat L_P < 1\), then \[ α \le \frac{2}{1-γ}\biggl[ ε + \frac{γ δ\hat L_r}{1 - γ \hat L_{P}} \biggr] \]
2. If the original model \({\ALPHABET M}\) is \((L_r, L_P)\) Lipschitz with \(γ L_P < 1\), then \[ α \le \frac{2}{(1-γ)^2}\biggl[ ε + \frac{γ δ L_r}{1 - γ L_{P}} \biggr] \]

Proof

The result follows from Theorem 25.6 the observation that \(\SPAN(\hat V^*) \le (1-γ)^{-1} \SPAN(\hat r)\) and \(\SPAN(V^*) \le (1-γ)^{-1} \SPAN(r)\).
The result follows from Theorem 25.6 and Theorem 19.1.

25.3 Example: Inventory management with incorrect demand distribution

Let’s consider the inventory management problem. Suppose that the demand process has a PDF \(f_W\) but we choose a policy \(\hat π^*\) according to an estimated demand PDF \(f_{\hat W}\). What is the model approximation error if the policy \(\hat π^*\) is used instead of \(π^*\)?

To upper bound the model approximation error, we use the instance independent bounds of Corollary 25.1 in terms of the Wasserstein distance. Note that since there is no error in modeling the per-step cost, \(ε = 0\). To bound \(δ\), note that pdf of the dynamics are given by: \[ p(\cdot | s,a) = f_W(\cdot -s+a) \] which is the same as the demand distribution shifted by \((s-a)\). Thus, \[ \ALPHABET K(p(\cdot | s,a), \hat p(\cdot | s,a)) = \ALPHABET K(f_W(\cdot - s + a), f_{\hat W}(\cdot - s + a)) = \ALPHABET K(f_W, f_{\hat W}) \] where the last equality uses the fact that shifting two distributions does not change their Wasserstein distance¹. Thus, \[ δ = \ALPHABET K(f_W, f_{\hat W}) \] is the Wasserstein distance between the

¹ For our 1-dimension setting, this fact can be seen immediately from the formula of Wasserstein distance in terms of the CDF of random varaibles: \[ \ALPHABET K(X,Y) = \int_{-∞}^∞ \ABS{ F_X(t) - F_Y(t) } dt. \] For a more general argument, see the next section.

We know from Example 19.1 that the inventory management model is \((p + \max\{c_h,cs\}, 1)\)-Lipschitz. Therefore, the approximate value function \(\hat V^*\) is \(\hat L_V\)-Lipschitz (see Theorem 19.1) with \[ \hat L_V \le \frac{p + \max\{c_h, c_s\}}{1 - γ}. \]

Therefore, we get that the policy \(\hat π^*\) is \(α\)-optimal, where \[ α \le \frac{2 γ}{(1-γ)^2}(p + \max\{c_h, c_s\}) \ALPHABET K(f_W, f_{\hat W}). \]

25.4 Example: Performance loss in using certainty equivalent control

Certainty equivalence refers to the following design methodology to determine a control policy for a stochastic control problem. Replace the random variables in the stochastic control problem by their (conditional) expectations, solve the resulting deterministic control problem to determine a feedback control policy, and use the resulting certainty equivalent control policy in the original stochastic system. See notes on linear quadratic regulation for an example where certainty equivalence is optimal.

But this is not the case in general. In this section, we use the results of Theorem 25.6 to characterize the performance loss when using certainty equivalence for dynamic models.

Consider a system with state space \(\reals^n\), action space \(\reals^m\), and dynamics \[\begin{equation}\label{eq:stochastic} S_{t+1} = f(S_t, A_t) + N_t \end{equation}\] where \(f\) is a measurable function and \(\{N_t\}_{t \ge 1}\) is a zero-mean i.i.d. noise sequence with control law \(\nu_N\). The per-step cost is given by \(c(S_t, A_t)\).

Now consider a deterministic model obtained by assuming that the noise sequence in \(\eqref{eq:stochastic}\) takes its expected value, i.e., the dynamics are \[\begin{equation}\label{eq:deterministic} S_{t+1} = f(S_t, A_t). \end{equation}\] The per-step cost is the same as before.

Let \(\ALPHABET M\) denote the stochastic model and \(\hat {\ALPHABET M}\) denote the deterministic model. Then, the certainty equivalent design is to use the control policy \(\hat \pi^*\) in original stochastic model \(\ALPHABET M\). We use the Wasserstein distance based bounds in Corollary 25.1 to bound \(\NORM{V^{\hat \pi^*} - V^*}_{∞}\). We assume that there is some norm \(\| \cdot \|\) on \(\reals^n\) and the Wasserstein distance and Lipschitz constant are computed with respect to this norm.

Since the costs are the same for both models, \(ε = 0\). We now characterize \(\delta\). For ease of notation, given random variables \(X\) and \(Y\) with probability laws \(\nu_X\) and \(\nu_Y\), we will use \(\ALPHABET K(X,Y)\) to denote \(\ALPHABET K(\nu_X, \nu_Y)\). Recall that Wasserstein distance is defined as (Villani et al. 2008) \[\begin{equation}\label{eq:Kantorovich} \ALPHABET K(\nu_X, \nu_Y) = \inf_{ \substack{ \tilde X \sim \nu_X \\ \tilde Y \sim \nu_Y} } \EXP[ \| \tilde X - \tilde Y \| ]. \end{equation}\] Now, for a fixed \((s,a)\), define \(X = f(s,a) + N\), where \(N \sim \nu_N\), and \(Y = f(s,a)\). Then, the Wasserstein distance between \(P(\cdot | s,a)\) and \(\hat P(\cdot | s,a)\) is equal to \(\ALPHABET K(X,Y)\), which by \(\eqref{eq:Kantorovich}\) equals \(\EXP[\| N \|]\), which does not depend on \((s,a)\). Thus, \[ δ = \EXP[\NORM{N}]. \]

Thus, by Corollary 25.1, we get \[\begin{equation}\label{eq:CE-bound} \NORM{ V^{\hat \pi^*} - V^*}_{∞} \le \frac{2\gamma}{1- \gamma} \EXP[ \| N \| ] L_{\hat V^*}. \end{equation}\] This bound precisely quantifies the engineering intuition that certainty equivalent control laws are good when the noise is “small”. This bound may be viewed as a generalization of the bounds on certainty equivalence for stochastic optimization presented earlier.

25.5 Variations of a theme: State abstraction

In the analysis above, we assumed that the model \(\ALPHABET M\) and the approximate model \(\widehat {\ALPHABET M}\) were defined on the same state space. Suppose that is not the case and the approximate model \(\widehat {\ALPHABET M}\) is defined on a different state space \(\hat {\ALPHABET S}\), i.e., \(\widehat {\ALPHABET M} = (\hat {\ALPHABET S}, \ALPHABET A, \hat P, \hat c, γ)\).

In addition, suppose we are given a surjective function \(φ \colon \ALPHABET S \to \hat {\ALPHABET S}\). We can use \(φ\) to lift any function \(\hat f\) defined on \(\hat {\ALPHABET S}\) to a function defined on \(\ALPHABET S\) given by \(f = \hat f \circ φ\), i.e., \[f(s) = \hat f(φ(s)), \quad \forall s \in \ALPHABET S.\] We can also use \(φ\) to project any function \(f\) defined on \(\ALPHABET S\) to a function defined on \(\hat {\ALPHABET S}\). For this, we assume that we are given a measure \(ν\) on \(\ALPHABET S\) that has the following property: \[\begin{equation}\label{eq:nu} ν(φ^{-1}(\hat s)) = 1, \quad \forall \hat s \in \hat {\ALPHABET S}. \end{equation}\] In the sequel, we assume that the measure \(ν\) is fixed and do not carry it in the notation. Given \(ν\), and a function \(f\) defined on \(\ALPHABET S\), define \[ \def\SQ{\mathbin{\square}} (f \SQ φ)(\hat s) = \sum_{s \in φ^{-1}(\hat s)} f(s) ν(s), \quad \forall \hat s \in \hat {\ALPHABET S}. \] The lift and projection operators satisfy the following properties:

Lemma 25.3

For any functions \(\hat f_1, \hat f_2 \colon \hat {\ALPHABET S} \to \reals\), \[ \NORM{ \hat f_1 \circ φ - \hat f_2 \circ φ }_{∞} \le \NORM{ \hat f_1 - \hat f_2 }_{∞}. \]
For any function \(f \colon \ALPHABET S \to \reals\) and \(\hat f \colon \hat {\ALPHABET S} \to \reals\), we have \[ \NORM{ f \SQ φ - \hat f }_{∞} \le \NORM{ f - \hat f \circ φ}_{∞}. \]

Proof

The first property is an immediate consequence of the definition of the sup-norm.

For the second property, observe that \[\begin{align*} \NORM{ f \SQ φ - \hat f }_{∞} &= \sup_{\hat s \in \hat {\ALPHABET S}} \bigg| \sum_{s \in φ^{-1}(\hat s)} ν(s) f(s) - \hat f(\hat s) \biggr| \notag \\ &= \sup_{\hat s \in \hat {\ALPHABET S}} \bigg| \sum_{s \in φ^{-1}(\hat s)} ν(s) \bigl[ f(s) - \hat f(\hat s) \bigr] \biggr| \notag \\ &\le \sup_{\hat s \in \hat {\ALPHABET S}} \sum_{s \in φ^{-1}(\hat s)} ν(s) \bigl| f(s) - \hat f(\hat s) \bigr| \notag \\ &\le \sup_{\hat s \in \hat {\ALPHABET S}} \sum_{s \in φ^{-1}(\hat s)} ν(s) \NORM{ f - \hat f \circ φ }_{∞} \notag \\ & = \NORM{ f - \hat f \circ φ}_{∞} \end{align*}\] where the first inequality follows from triangle inequality and the second follows from the definition of the sup-norm.

As before, let \(V^*\) and \(\hat V^*\) denote the optimal value functions of the true model \(\ALPHABET M\) and the approximate model \(\hat {\ALPHABET M}\), respectively. Moreover, let \(π^*\) and \(\hat π^*\) be optimal policies for the true model and the approximate model \(\widehat {\ALPHABET M}\), respectively.

We are interested in the following questions:

Value error bounds: What is the error if \(\hat V^* \circ φ\) is used as an approximation of \(V^*\)?
Model approximation error: What is the error if the policy \(\hat π^* \circ φ\) is used instead of the optimal policy \(π^*\).

In order to answer these questions, the following question on policy error bounds is useful:

Policy error bounds: Given a policy \(\hat π\) for model \(\widehat {\ALPHABET M}\), what is error if \(\hat V^{\hat π} \circ φ\) is used as an approximation for \(V^{\hat π \circ φ}\)?

25.5.1 Policy and value error bounds

As before, we let \(\BELLMAN^{π}\) and \(\BELLMAN^*\) denote the Bellman operators for policy \(π\) and optimality Bellman operator for model \(\ALPHABET M\) and let \(\hat {\BELLMAN}^{\hat π}\) and \(\hat {\BELLMAN}^*\) denote the corresponding quantifies for model \(\hat {\ALPHABET M}\). Note that in this case the operators \(\BELLMAN\) and \(\hat {\BELLMAN}\) are defined over different spaces.

We now define two classes of Bellman mismatch functions:

Functionals \(\MISMATCH^{π}_{φ}, \MISMATCH^*_{φ} \colon [\ALPHABET S \to \reals] \to \reals\), defined as follows: \[\begin{align*} \MISMATCH^{π}_{φ}v &= \NORM{ \BELLMAN^{π} v - (\hat {\BELLMAN}^{π\SQ φ}(v\SQ φ)) \circ φ}_{∞} \\ \MISMATCH^*_{φ} v &= \NORM{ \BELLMAN^* v - (\hat {\BELLMAN}^* (v \SQ φ)) \circ φ }_{∞} \end{align*}\] Also define the maximum Bellman mismatch functional as \[\begin{align*} \MISMATCH^{\max}_{φ} v &= \max_{(s,a) \in {\ALPHABET S} × \ALPHABET A} \biggl| c(s,a) + γ \sum_{s' \in \ALPHABET S}P(s'|s,a) v(s') \biggr] \\ &\hskip 4em - \hat c(φ(s), a) - γ \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s' | φ(s), a) \sum_{s' \in φ^{-1}(\hat s')} ν(s') v(s') \biggr| \end{align*}\]
Functionals \(\hat \MISMATCH^{\hat π}_{φ}, \hat \MISMATCH^*_{φ} \colon [\hat {\BELLMAN} \to \reals] \to \reals\) defined as follows: \[\begin{align*} \hat \MISMATCH^{\hat π}_{φ}\hat v &= \NORM{ \BELLMAN^{\hat π \circ φ}(\hat v \circ φ) - (\hat {\BELLMAN}^{\hat π} \hat v) \circ φ }_{∞} \\ \hat \MISMATCH^*_{φ} \hat v &= \NORM{ \BELLMAN^*(\hat v \circ φ) - (\hat {\BELLMAN}^* \hat v) \circ φ }_{∞} \end{align*}\] Also define the maximum Bellman mismatch functional as \[\begin{align*} \hat \MISMATCH^{\max}_{φ} \hat v &= \max_{(s,a) \in \ALPHABET S × \ALPHABET A} \biggl| c(s,a) + γ \sum_{s' \in \ALPHABET S}P(s'|s,a) \hat v(φ(s')) \\ &\hskip 4em - \hat c(φ(s), a) - γ \sum_{\hat s' \in \hat {\ALPHABET S}} \hat P(\hat s' | φ(s), a) \hat v(\hat s') \biggr| \end{align*}\]

Similar to Lemma 25.1, we have the following.

Lemma 25.4 The following inequalities hold:

\(\sup_{π \in Π} \MISMATCH^{π}_{φ} v = \MISMATCH^{\max}_{φ} v\)
\(\MISMATCH^*_{φ} v \le \MISMATCH^{\max}_{φ} v\).

and also

\(\sup_{\hat π \in \hat Π} \hat \MISMATCH^{\hat π}_{φ} v = \hat \MISMATCH^{\max}_{φ} \hat v\)
\(\hat \MISMATCH^*_{φ} \hat v \le \hat \MISMATCH^{\max}_{φ} \hat v\).

The Bellman mismatch functionals can be used to bound the performance difference of a policy between the true and the approximate model.

Proposition 25.3 (Policy error) For any (possibly randomized) policy \(π\) in \(\ALPHABET M\) and \(\hat π\) in \(\hat {\ALPHABET M}\), we have \[ \NORM{V^{\hat π \circ φ} - \hat V^{\hat π} \circ φ}_{∞} \le \frac{1}{1-γ}\min\bigl\{ \MISMATCH^{\hat π \circ φ}_{φ} V^{\hat π \circ φ}, \hat \MISMATCH^{\hat π}_{φ} \hat V^{\hat π} \bigr\}. \]

Proof

The proof is similar to the proof of Proposition 25.1. The first bound is obtained as follows: \[\begin{align} \| V^{\hat π \circ φ} - \hat V^{\hat π} \circ φ \|_∞ &= \| \BELLMAN^{\hat π \circ φ} V^{\hat π \circ φ} - (\hat {\ALPHABET B}^{\hat π} \hat V^{\hat π}) \circ φ \|_∞ \notag \\ &\le \| \BELLMAN^{\hat π \circ φ} V^{\hat π \circ φ} - (\hat {\ALPHABET B}^{\hat π} (V^{\hat π \circ φ} \SQ φ)) \circ φ \|_∞ \notag \\ & \quad + \| (\hat {\BELLMAN}^{\hat π} (V^{\hat π \circ φ} \SQ φ)) \circ φ - (\hat {\ALPHABET B}^{\hat π} \hat V^{\hat π}) \circ φ \|_∞ \notag \\ &\le \MISMATCH^{\hat π \circ φ}_{φ} V^{\hat π \circ φ} + γ \| V^{\hat π \circ φ} \SQ φ - \hat V^{\hat π} \|_∞ \notag \\ &\le \MISMATCH^{\hat π \circ φ}_{φ} V^{\hat π \circ φ} + γ \| V^{\hat π \circ φ} - \hat V^{\hat π} \circ φ \|_∞ \label{eq:ineq-3-abstract} \end{align}\] where the first inequality follows from the triangle inequality, and the second inequality follows from the definition of the Bellman mismatch functional, the contraction property of Bellman operators, and Lemma 25.3, and the last inequality also follows from Lemma 25.3. Rearranging terms in \eqref{eq:ineq-3-abstract} gives us \[\begin{equation} \| V^{\hat π \circ φ} - \hat V^{\hat π} \circ φ \|_∞ \le \frac{ \MISMATCH^{\hat π \circ φ}_{φ} V^{\hat π \circ φ}}{1 - γ}. \label{eq:ineq-4-abstract}\end{equation}\] This gives the first bound.

The second bound is obtained as follows \[\begin{align} \| V^{\hat π \circ φ} - \hat V^{\hat π} \circ φ \|_∞ &= \| \BELLMAN^{\hat π \circ φ} V^{\hat π \circ φ} - (\hat {\ALPHABET B}^{\hat π} \hat V^{\hat π}) \circ φ \|_∞ \notag \\ &\le \| \BELLMAN^{\hat π \circ φ} V^{\hat π \circ φ} - \BELLMAN^{\hat π \circ φ}(\hat V^{\hat π} \circ φ) \|_{∞} \notag \\ & \quad + \| \BELLMAN^{\hat π \circ φ}(\hat V^{\hat π} \circ φ) - (\hat {\ALPHABET B}^{\hat π} \hat V^{\hat π}) \circ φ \|_∞ \notag \\ &\le γ \| V^{\hat π \circ φ} - \hat V^\hat π \circ φ \|_∞ + \hat \MISMATCH^{\hat π}_{φ} \hat V^π \label{eq:ineq-13-abstract} \end{align}\] Rearranging terms in \eqref{eq:ineq-13-abstract} gives us \[\begin{equation} \| V^{\hat π \circ φ} - \hat V^{\hat π} \circ φ \|_∞ \le \frac{ \hat \MISMATCH^{\hat π}_{φ} \hat V^{\hat π}}{1 - γ}. \label{eq:ineq-14-abstract}\end{equation}\] This gives the second bound.

Similar to the above, we can also bound the difference between the optimal value functions of the true and approximate model.

Proposition 25.4 (Value error) Let \(V^*\) and \(\hat V^*\) denote the optimal value functions for \(\ALPHABET M\) and \(\hat {\ALPHABET M}\) respectively. Then, \[ \NORM{V^* - \hat V^* \circ φ}_{∞} \le \frac{1}{1-γ} \min\bigl\{ \MISMATCH^*_{φ} V^*, \hat \MISMATCH^*_{φ} \hat V^* \bigr\}. \]

Proof

The proof argument is similar to the proof of Proposition 25.2. The first bound is obtained as follows: \[\begin{align} \| V^{*} - \hat V^{*} \circ φ \|_∞ &= \| \BELLMAN^* V^* - (\hat {\BELLMAN}^* \hat V^*) \circ φ \|_∞ \notag \\ &\le \| \BELLMAN^* V^* - \hat {\BELLMAN}^*(V^* \SQ φ) \circ φ \|_∞ + \| \hat {\BELLMAN}^*(V^* \SQ φ) \circ φ - \hat {\BELLMAN}^* \hat V^* \circ φ\|_∞ \notag \\ &\le \MISMATCH^*_{φ} V^* + γ \| V^* \SQ φ - \hat V^* \|_∞ \notag \\ &\le \MISMATCH^*_{φ} V^* + γ \| V^* - \hat V^* \circ φ \|_∞ \label{eq:ineq-1-abstract} \end{align}\] where the first inequality follows from the triangle inequality, and the second inequality follows from the definition of the Bellman mismatch functional, the contraction property of Bellman operators, and Lemma 25.3, and the last inequality follows from Lemma 25.3 as well. Rearranging terms in \eqref{eq:ineq-1-abstract} gives us \[\begin{equation} \| V^* - \hat V^* \circ φ\|_∞ \le \frac{ \MISMATCH^*_{φ} V^*}{1 - γ}. \label{eq:ineq-2-abstract}\end{equation}\] This gives the first bound.

The second bound is obtained as follows: and \(\hat V^*\). \[\begin{align} \| V^{*} - \hat V^{*} \circ φ \|_∞ &= \| \BELLMAN^* V^* - (\hat {\ALPHABET B}^* \hat V^*) \circ φ \|_∞ \notag \\ &\le \| \BELLMAN^* V^* - \ALPHABET B^* (\hat V^* \circ φ) \|_∞ + \| \BELLMAN^* (\hat V^* \circ φ) - (\hat {\ALPHABET B}^* \hat V^*) \circ φ \|_∞ \notag \\ &\le γ \| V^* - \hat V^* \circ φ \|_∞ + \hat \MISMATCH^*_{φ} \hat V^* \label{eq:ineq-11-abstract} \end{align}\] Rearranging terms in \eqref{eq:ineq-11-abstract} gives us \[\begin{equation} \| V^{*} - \hat V^{*} \circ φ \|_∞ \le \frac{ \hat \MISMATCH^*_{φ} \hat V^{*}}{1 - γ}. \label{eq:ineq-12-abstract}\end{equation}\] This gives the second bound.

25.5.2 Model approximation error.

Recall that we can split the model error using triangle inequality as in \(\eqref{eq:triangle-1}\), which we repeat here for convenience. \[\begin{equation*} \| V^* - V^{\hat π^*} \|_∞ \le \| V^* - \hat V^{\hat π^*} \|_∞ + \| V^{\hat π^*} - \hat V^{\hat π^*} \|_∞. \end{equation*}\]

Proposition 25.3 and Proposition 25.4 provide bounds for both the terms, which immediately gives us the following.

Theorem 25.4 (Model approximation error) The policy \(\hat π^*\) is an \(α\)-optimal policy of \(\ALPHABET M\) where \[ α := \| V^* - V^{\hat π^* \circ φ} \|_∞ \le \frac{1}{1-γ} \bigl[ \MISMATCH^*_{φ} \hat V^* + \MISMATCH^{\hat π^*}_{φ} \hat V^* \bigr]. \] Moreover, since \(\MISMATCH^{\max}_{φ} \hat V^*\) is an upper bound for both \(\MISMATCH^{\hat π^*}_{φ} \hat V^*\) and \(\MISMATCH^*_{φ} \hat V^*\), we have \[ α \le \frac{2}{(1-γ)} \MISMATCH^{\max}_{φ} \hat V^*. \]

Similar to Theorem 25.2, we now provide such a bound that depends on \(V^*\) rather than \(\hat V^*\).

Theorem 25.5 (Model approximation error) The policy \(\hat π^*\) is an \(α\)-optimal policy of \(\ALPHABET M\) where \[ α := \| V^* - V^{\hat π^* \circ φ} \|_∞ \le \frac{1}{1-γ} \MISMATCH^{\hat π^* \circ φ}_{φ} V^* + \frac{(1+γ)}{(1-γ)^2} \MISMATCH^*_{φ} V^* . \]

Moreover, since \(\MISMATCH^{\max}_{φ} V^*\) is an upper bound for both \(\MISMATCH^{\hat π^*}_{φ} V^*\) and \(\MISMATCH^*_{φ} V^*\), we have \[ α \le \frac{2}{(1-γ)^2} \MISMATCH^{\max}_{φ} V^*. \]

Proof

We bound the first term of \(\eqref{eq:triangle-1}\) by Proposition 25.4 But instead of bounding the second term of \(\eqref{eq:triangle-1}\) by Proposition 25.3, we consider the following: \[\begin{align} \| V^{\hat π^* \circ φ} - \hat V^{\hat π^*} \circ φ \|_∞ &= \| V^{\hat π^* \circ φ} - \hat V^{*} \circ φ \|_∞ = \| \BELLMAN^{\hat π^* \circ φ} V^{\hat π^* \circ φ} - (\hat {\BELLMAN}^{\hat π^*} \hat V^{*}) \circ φ \|_∞ \notag \\ &\le \| \BELLMAN^{\hat π^* \circ φ} V^{\hat π^* \circ φ} - \BELLMAN^{\hat π^* \circ φ} V^{*} \|_∞ + \| \BELLMAN^{\hat π^* \circ φ} V^{*} - (\hat {\BELLMAN}^{\hat π^*} (V^{*} \SQ φ)) \circ φ \|_∞ \notag \\ & \quad + \| (\hat {\BELLMAN}^{\hat π^*} (V^{*} \SQ φ)) \circ φ - (\hat {\BELLMAN}^{\hat π^*} \hat V^{*}) \circ φ \|_∞ \notag \\ &\le γ \| V^* - V^{\hat π^* \circ φ} \|_∞ + \MISMATCH^{\hat π^* \circ φ}_{φ} V^* + γ \| V^* \SQ φ - \hat V^* \|_∞ \notag \\ &\le γ \| V^* - V^{\hat π^* \circ φ} \|_∞ + \MISMATCH^{\hat π^* \circ φ}_{φ} V^* + γ \| V^* - \hat V^* \circ φ \|_∞ \label{eq:ineq-21-abstract}. \end{align}\] where the first inequality follows from the triangle inequality and the second inequality follows from the definition of Bellman mismatch functional, contraction property of Bellman operator, and Lemma 25.3, and the last inequality follows from Lemma 25.3 as well.

Substituting \(\eqref{eq:ineq-21-abstract}\) in \(\eqref{eq:triangle-1}\) and rearranging terms, we get \[\begin{align} \| V^* - V^{\hat π^* \circ φ} \|_∞ &\le \frac{1}{1-γ} \MISMATCH^{\hat π^* \circ φ}_{φ} V^* + \frac{1+γ}{1-γ} \| V^* - \hat V^* \circ φ \|_∞ \notag \\ &\le \frac{1}{1-γ} \MISMATCH^{\hat π^* \circ φ}_{φ} V^* + \frac{(1+γ)}{(1-γ)^2} \MISMATCH^*_{φ} V^* . \end{align}\] where the second inequality follows from Proposition 25.4.

25.6 IPM based bounds for state abstraction

The high-level idea for the IPM based bounds remains the same as before: we provide an upper bound on \(\MISMATCH^{\max}_{φ}\) and \(\hat \MISMATCH^{\max}_{φ}\) based on an appropriate notion of “distance” between \((P,c)\) and \((\hat P, \hat c)\).

Given two models \(\ALPHABET M = (\ALPHABET S, \ALPHABET A, P, c)\) and \(\hat {\ALPHABET M} = (\hat {\ALPHABET S}, \ALPHABET A, \hat P, \hat c)\), an abstraction function \(φ \colon \ALPHABET S \to \hat {\ALPHABET S}\), and a measure \(ν\) on \(\ALPHABET S\) that satisfies \(\eqref{eq:nu}\), we define the following distances between the two models:

Cost difference \[ε_{φ} \coloneqq \sup_{(s,a) \in \ALPHABET S × \ALPHABET A} \ABS{ c(s,a) - \hat c(φ(s),a) }.\]
Dynamics distance in \(\ALPHABET S\): \[δ_{φ} \coloneqq \sup_{(s,a) \in \ALPHABET S × \ALPHABET A} d_{\F}(P(\cdot | s,a), \hat P_{φ}( \cdot | φ(s), a)),\] where \(\hat P_{φ}(s' | \hat s, a) = \hat P(φ(s') | \hat s, a).\)
Dynamics distance in \(\hat {\ALPHABET S}\): \[\hat δ_{φ} \coloneqq \sup_{(s,a) \in \ALPHABET S × \ALPHABET A} d_{\F}(P_{φ}(\cdot | s,a), \hat P( \cdot | φ(s), a)),\] where \(P_{φ}(\hat s' | s, a) = P(φ^{-1}(\hat s') | s, a).\)

An immediate implication of the above definition is the following.

Lemma 25.5

For any \(v \colon \ALPHABET S \to \reals\) \[ \MISMATCH^{\max}_{φ} v \le ε + γ δ_{φ} \, ρ_{\F}(v). \]
For any \(\hat v \colon \hat {\ALPHABET S} \to \reals\) \[ \hat \MISMATCH^{\max}_{φ} \hat v \le ε + γ \hat δ_{φ} \, ρ_{\F}(\hat v). \]

Proof

An immediate consequence of Lemma 25.5 when combined with Theorem 25.5 and Theorem 25.4 is the following.

Theorem 25.6 (Model approximation error) In the setup of this section, we have \[ α := \| V^* - V^{\hat π^* \circ φ} \|_∞ \le \frac{2}{1-γ} \bigl[ ε_{φ} + γ \hat δ_{φ} ρ_{\F}(\hat V^*)\bigr]. \] Another upper bound on \(α\) is \[ α \le \frac{2}{(1-γ)^2} \bigl[ ε_{φ} + γ δ_{φ} ρ_{\F}(V^*)\bigr]. \]

Note that the above bounds require the knowledge of \(\hat V^*\). As in Corollary 25.1, we can obtain instance independent bounds by upper bounding \(ρ_{\F}(\hat V^*)\) and \(ρ_{\F}(V^*)\).

Exercises

Exercise 25.1 The objective of this exercise is to obtain a tighter bound in terms of \(ρ_{\F}(V^*)\) than the one in Theorem 25.6.

Let \(δ_{\F}\) denote the error in the dynaimcs corresponding to function class \(\F\), i.e., \[ δ_{\F} = \sup_{(s,a) \in \ALPHABET S × \ALPHABET A} d_{\F} (P(\cdot \mid s,a), \hat P(\cdot \mid s,a)). \] When \(\F = \{ f : \tfrac 12 \SPAN(f) \le 1 \}\) (or equivalently, \(\F = \{f : \NORM{f}_{∞} \le 1 \}\)), \(d_{\F}\) is the total variation distance², and we denote \(δ_{\F}\) as \(δ_{\rm TV}\).

² See notes on IPM for details.

Show that for any \(v,w \colon \ALPHABET S \to \reals\), we have \[ \MISMATCH^{\max} v \le ε + γ δ_{\F} ρ_{\F}(w) + γ δ_{\rm TV} \| v - w \|_{∞}. \]
Using Lemma 25.1 and Proposition 25.1, show that \[ \NORM{V^{\hat π^*} - \hat V^*}_{∞} \le \frac{1}{1-γ} \MISMATCH^{\max} \hat V^* \le \frac{1}{1-γ} \bigl[ ε + γ δ_{\F} ρ_{\F}(V^*) + γ δ_{\rm TV} \NORM{V^* - \hat V^*}_{∞} \bigr]. \]
Using \(\eqref{eq:triangle-1}\), show that \[\begin{align*} \NORM{V^* - V^{\hat π^*}}_{∞} &\le \frac{1}{1-γ} \bigl[ ε + γ δ_{\F} ρ_{\F}(V^*) \bigr] + \Bigl(1 + \frac{γ δ_{\rm TV}}{1 - γ} \Bigr) \NORM{V^* - \hat V^*} \\ &\le \Bigl(\frac{2}{1-γ} + \frac{γ δ_{\rm TV}}{(1 - γ)^2}\Bigr) \bigl[ ε + γ δ_{\F} ρ_{\F}(V^*) \bigr]. \end{align*}\]
Observe that \[\Bigl(\frac{2}{1-γ} + \frac{γ δ_{\rm TV}}{(1 - γ)^2}\Bigr) = \frac{2 - γ(2 - δ_{\rm TV})}{(1-γ)^2} \le \frac{2}{(1-γ)^2} \] where the last inequality uses the fact that \(δ_{\rm TV} \le 2\). Therefore, the bound obtained in the previous part is tighter than the bound of Theorem 25.6.

Notes

The material in this section is adapted from Bozkurt et al. (2023), where the results were presented for unbounded per-step cost. The IPM-based bounds of Theorem 25.6 are due to Müller (1997), but the proof is adapted from Bozkurt et al. (2023), where some generalizations of Theorem 25.6 are also presented. The total variation bound in Corollary 25.1 is due to Müller (1997). The Wasserstein distance based bound in Corollary 25.1 is due to Asadi et al. (2018).

The approximation bound for the inventory management example is from Müller (1997). The approximation bound for certainty equivalence is from Bozkurt et al. (2023).

The earliest work on state abstraction is due to Whitt (1978), Whitt (1979). Generalizations similar to the ones in Section 25.5 are also presented in Gelada et al. (2019) and Saldi et al. (2018). The proofs here are adapted from Sinha and Mahajan (2024).

Exercise 25.1 is adapted from Zhou et al. (2024).