25  Model approximation

Updated

February 6, 2024

Keywords

infinite horizon, discounted cost, Lipschitz continuity, approximation bounds, state aggregation

There are many instances where approximate models are used. For example, in many applications with large state spaces, one often construct a simulation model of the system and use it to identify an optimal policy using simulation based methods (such as reinforcement learning). Often, the simulation model is only an approximation of the true model. In such instances, we want to know the error in using the policy obtained from the simulation model in the real world. In this section, we present bounds on such sim-to-real transfer.

Consider an MDP \(\ALPHABET M = \langle \ALPHABET S, \ALPHABET A, P, c, γ \rangle\). Suppose the components \(\langle \ALPHABET S, \ALPHABET A, γ \rangle\) are known exactly but the components \((P,c)\) are known approximately. Consider the approximate MDP \(\widehat {\ALPHABET M} = \langle \ALPHABET S, \ALPHABET A, \hat P, \hat c, γ \rangle\). We will call \(\ALPHABET M\) to be the true model and \(\widehat {\ALPHABET M}\) to be the approximate model.

Let \(V^*\) and \(\hat V^*\) denote the optimal value functions of the true model \(\ALPHABET M\) and the approximate model \(\widehat {\ALPHABET M}\), respectively. Moreover, let \(π^*\) and \(\hat π^*\) be optimal policies for the true model \(\ALPHABET M\) and the approximate model \(\widehat {\ALPHABET M}\), respectively.

We are interested in the following questions:

  1. Policy error bounds: Given a policy \(π\), what is the error if \(\hat V^π\) is used as an approximation for \(V^π\)?

  2. Value error bounds: What is the error if \(\hat V^*\) is used as an approximation for \(V^*\)?

  3. Model approximation error: What is the error if the policy \(\hat π^*\) is used instead of the optimal policy \(π^*\)

25.1 Bounds for model approximation

25.1.1 Policy and value error bounds

Let \(\BELLMAN^π\) and \(\BELLMAN^*\) denote the Bellman operator for policy \(π\) and the optimality Bellman operator for model \(\ALPHABET M\). Let \(\hat {\BELLMAN}^π\) and \(\hat {\BELLMAN}^*\) denote the corresponding quantities for model \(\widehat {\ALPHABET M}\). Define the Bellman mismatch functionals \(\MISMATCH^π\) and \(\MISMATCH^*\) as follows: \[\begin{align*} \MISMATCH^π v &= \| \BELLMAN^π v - \hat {\BELLMAN}^π v \|_∞, \\ \MISMATCH^* v &= \| \BELLMAN^* v - \hat {\BELLMAN}^* v \|_∞ . \end{align*}\]

Also define the maximum Bellman mismatch as \[\begin{align*} \MISMATCH^{\max} v &= \max_{(s,a) \in \ALPHABET S, A} \biggl\lvert c(s,a) + γ \sum_{s' \in \ALPHABET S} P(s'|s,a)v(s') \notag \\ & \hskip 6em -\hat c(s,a) - γ \sum_{s' \in \ALPHABET S} \hat P(s'|s,a) v(s') \biggr\rvert. \end{align*}\]

Lemma 25.1 The following inequalities hold:

  • \(\sup_{π \in Π} \MISMATCH^π v = \MISMATCH^{\max} v\)
  • \(\MISMATCH^* v \le \MISMATCH^{\max} v\).

The Bellman mismatch functional can be used to bound the performance difference of a policy between the true and approximate models.

Proposition 25.1 (Policy error) For any (possibly randomized) policy \(π\), \[\begin{equation}\label{eq:policy-error} \| V^{π} - \hat V^{π} \|_∞ \le \frac{1}{1-γ} \min\{ \MISMATCH^π V^{π}, \MISMATCH^π \hat V^{π} \}. \end{equation}\]

We bound the left hand side of \eqref{eq:policy-error} in two ways. The first way is as follows: \[\begin{align} \| V^{π} - \hat V^{π} \|_∞ &= \| \BELLMAN^π V^π - \hat {\ALPHABET B}^π \hat V^π \|_∞ \notag \\ &\le \| \BELLMAN^π V^π - \hat {\ALPHABET B}^π V^π \|_∞ + \| \hat {\BELLMAN}^π V^π - \hat {\ALPHABET B}^π \hat V^π \|_∞ \notag \\ &\le \MISMATCH^π V^π + γ \| V^π - \hat V^π \|_∞ \label{eq:ineq-3} \end{align}\] where the first inequality follows from the triangle inequality, and the second inequality follows from the definition of the Bellman mismatch functional and the contraction property of Bellman operators. Rearranging terms in \eqref{eq:ineq-3} gives us \[\begin{equation} \| V^{π} - \hat V^{π} \|_∞ \le \frac{ \MISMATCH^π V^{π}}{1 - γ}. \label{eq:ineq-4}\end{equation}\] This gives the first bound.

The second bound is symmetric and obtained by interchanging the roles of \(V^π\) and \(\hat V^π\). \[\begin{align} \| V^{π} - \hat V^{π} \|_∞ &= \| \BELLMAN^π V^π - \hat {\ALPHABET B}^π \hat V^π \|_∞ \notag \\ &\le \| \BELLMAN^π V^π - \ALPHABET B^π \hat V^π \|_∞ + \| \BELLMAN^π \hat V^π - \hat {\ALPHABET B}^π \hat V^π \|_∞ \notag \\ &\le γ \| V^π - \hat V^π \|_∞ + \MISMATCH^π \hat V^π \label{eq:ineq-13} \end{align}\] Rearranging terms in \eqref{eq:ineq-13} gives us \[\begin{equation} \| V^{π} - \hat V^{π} \|_∞ \le \frac{ \MISMATCH^π \hat V^{π}}{1 - γ}. \label{eq:ineq-14}\end{equation}\] This gives the second bound.

Similar to the above, we can also bound the difference between the optimal value function of the true and approximate models.

Proposition 25.2 (Value error) Let \(V^*\) and \(\hat V^*\) denote the optimal value functions for \(\ALPHABET M\) and \(\widehat {\ALPHABET M}\), respectively. Then, \[\begin{equation}\label{eq:value-error} \| V^* - \hat V^* \|_∞ \le \frac{1}{1-γ} \min\{ \MISMATCH^* V^*, \MISMATCH^* \hat V^* \} \end{equation}\]

The proof argument is almost the same as the proof argument for Proposition 25.1. The first was is as follows: \[\begin{align} \| V^{*} - \hat V^{*} \|_∞ &= \| \BELLMAN^* V^* - \hat {\ALPHABET B}^* \hat V^* \|_∞ \notag \\ &\le \| \BELLMAN^* V^* - \hat {\ALPHABET B}^* V^* \|_∞ + \| \hat {\BELLMAN}^* V^* - \hat {\ALPHABET B}^* \hat V^* \|_∞ \notag \\ &\le \MISMATCH^* V^* + γ \| V^* - \hat V^* \|_∞ \label{eq:ineq-1} \end{align}\] where the first inequality follows from the triangle inequality, and the second inequality follows from the definition of the Bellman mismatch functional and the contraction property of Bellman operators. Rearranging terms in \eqref{eq:ineq-1} gives us \[\begin{equation} \| V^* - \hat V^* \|_∞ \le \frac{ \MISMATCH^* V^*}{1 - γ}. \label{eq:ineq-2}\end{equation}\] This gives the first bound.

The second bound is symmetric and obtained by interchanging the roles of \(V^*\) and \(\hat V^*\). \[\begin{align} \| V^{*} - \hat V^{*} \|_∞ &= \| \BELLMAN^* V^* - \hat {\ALPHABET B}^* \hat V^* \|_∞ \notag \\ &\le \| \BELLMAN^* V^* - \ALPHABET B^* \hat V^* \|_∞ + \| \BELLMAN^* \hat V^* - \hat {\ALPHABET B}^* \hat V^* \|_∞ \notag \\ &\le γ \| V^* - \hat V^* \|_∞ + \MISMATCH^* \hat V^* \label{eq:ineq-11} \end{align}\] Rearranging terms in \eqref{eq:ineq-11} gives us \[\begin{equation} \| V^{*} - \hat V^{*} \|_∞ \le \frac{ \MISMATCH^* \hat V^{*}}{1 - γ}. \label{eq:ineq-12}\end{equation}\] This gives the second bound.

25.1.2 Model approximation error

To bound the model error, we observe that from triangle inequality we have \[\begin{equation} \label{eq:triangle-1} \| V^* - V^{\hat π^*} \|_∞ \le \| V^* - \hat V^{\hat π^*} \|_∞ + \| V^{\hat π^*} - \hat V^{\hat π^*} \|_∞. \end{equation}\]

Proposition 25.1 and Proposition 25.2 provide bounds of both of the terms of \(\eqref{eq:triangle-1}\). Choosing appropriate values for both terms gives us the following

Theorem 25.1 (Model approximation error) The policy \(\hat π^*\) is an \(α\)-optimal policy of \(\ALPHABET M\) where \[ α := \| V^* - V^{\hat π^*} \|_∞ \le \frac{1}{1-γ} \bigl[ \MISMATCH^* \hat V^* + \MISMATCH^{\hat π^*} \hat V^* \bigr]. \] Moreover, since \(\MISMATCH^{\max} \hat V^*\) is an upper bound for both \(\MISMATCH^{\hat π^*} \hat V^*\) and \(\MISMATCH^* \hat V^*\), we have \[ α \le \frac{2}{(1-γ)} \MISMATCH^{\max} \hat V^*. \]

In some applications, it is useful to have a bound on model approximation error that depends on \(V^*\) rather than \(\hat V^*\). We provide such a bound below.

Theorem 25.2 (Model approximation error) The policy \(\hat π^*\) is an \(α\)-optimal policy of \(\ALPHABET M\) where \[ α := \| V^* - V^{\hat π^*} \|_∞ \le \frac{1}{1-γ} \MISMATCH^{\hat π^*} V^* + \frac{(1+γ)}{(1-γ)^2} \MISMATCH^* V^* . \]

Moreover, since \(\MISMATCH^{\max} V^*\) is an upper bound for both \(\MISMATCH^{\hat π^*} V^*\) and \(\MISMATCH^* V^*\), we have \[ α \le \frac{2}{(1-γ)^2} \MISMATCH^{\max} V^*. \]

We bound the first term of \(\eqref{eq:triangle-1}\) by Proposition 25.2 But instead of bounding the second term of \(\eqref{eq:triangle-1}\) by Proposition 25.1, we consider the following: \[\begin{align} \| V^{\hat π^*} - \hat V^{\hat π^*} \|_∞ &= \| V^{\hat π^*} - \hat V^{*} \|_∞ = \| \BELLMAN^{\hat π^*} V^{\hat π^*} - \hat {\BELLMAN}^{\hat π^*} \hat V^{*} \|_∞ \notag \\ &\le \| \BELLMAN^{\hat π^*} V^{\hat π^*} - \BELLMAN^{\hat π^*} V^{*} \|_∞ + \| \BELLMAN^{\hat π^*} V^{*} - \hat {\BELLMAN}^{\hat π^*} V^{*} \|_∞ + \| \hat {\BELLMAN}^{\hat π^*} V^{*} - \hat {\BELLMAN}^{\hat π^*} \hat V^{*} \|_∞ \notag \\ &\le γ \| V^* - V^{\hat π^*} \|_∞ + \MISMATCH^{\hat π^*} V^* + γ \| V^* - \hat V^* \|_∞ \label{eq:ineq-21}. \end{align}\] where the first inequality follows from the triangle inequality and the second inequality follows from the definition of Bellman mismatch functional and contraction property of Bellman operator.

Substituting \(\eqref{eq:ineq-21}\) in \(\eqref{eq:triangle-1}\) and rearranging terms, we get \[\begin{align} \| V^* - V^{\hat π^*} \|_∞ &\le \frac{1}{1-γ} \MISMATCH^{\hat π^*} V^* + \frac{1+γ}{1-γ} \| V^* - \hat V^* \|_∞ \notag \\ &\le \frac{1}{1-γ} \MISMATCH^{\hat π^*} V^* + \frac{(1+γ)}{(1-γ)^2} \MISMATCH^* V^* . \end{align}\] where the second inequality follows from Proposition 25.2.

Remark:

Note that the bound of Theorem 25.2 is tighter by a factor of \(1/(1-γ)\) but that bound is in terms of \(\hat V^*\). In some settings, a bound in terms of \(V^*\) is more desirable. Using Theorem 25.1 in such settings leads to scaling by \(1/(1-γ)\).

Comparison with policy error bound

It is interesting to compare the bound of Theorem 25.2 with the policy loss error derived in Theorem 24.1. In particular, instead of using policy \(\hat π^*\), suppose we use the policy \(μ = \GREEDY(\hat V^*)\), which is the greedy policy (in the true model) w.r.t. \(\hat V^*\). From Theorem 24.1, we know that \[ \NORM{V^* - V^{μ}}_{∞} \le \frac{2 γ}{1 - γ} \NORM{V^* - \hat V^*}_{∞} \le \frac{2 γ}{(1-γ)^2} \MISMATCH^{\max} \hat V^* \] where we have used Proposition 25.2 and Lemma 25.1 for the last inequality. In contast, the bound of Theorem 25.2 is tighter by a factor of \(1/(1-γ)\). In principle, this policy should be better than \(\hat π^*\). The above comparison shows that it may be possible to tighten the policy error loss in Theorem 24.1.

Note that when we use the upper bounds in terms of \(\MISMATCH^{\max} V^*\), then the two bounds match in \(1/(1-γ)\) factors. In particular, \[ \NORM{V^* - V^{μ}}_{∞} \le \frac{2 γ}{1 - γ} \NORM{V^* - \hat V^*}_{∞} \le \frac{2 γ}{(1-γ)^2} \MISMATCH^{\max} V^* \] which is slightly tighter than the bound in Theorem 25.2 on \(\NORM{V^* - V^{\hat π^*}}_{∞}\).

25.2 IPM based bounds on model approximation error

Sometimes, it is easier to think in terms of explicit bounds between the models. For example, we may characterize the error between models \(\ALPHABET M\) and \(\hat {\ALPHABET M}\) as the distance between their cost functions and transition dynamics. For formalize this notion, we need to specify a metric on probability spaces. It turns out that integral probability metrics (IPM) are ideally suited for this task.

Let \(\def\F{\mathfrak{F}}\F\) be a convex and balanced set of functions from \(\ALPHABET S\) to \(\reals\). Then, the IPM distance (w.r.t. \(\F\)) between two probability laws \(μ_1\) and \(μ_2\) is given by \[ d_{\F}(ν_1, ν_2) = \sup_{f \in \F} \left| \int f d μ_1 - \int f d μ_2 \right|. \] For our discussion, we will assume that \(\F\) is a maximal generator. See the notes on IPM for more details, in particular the notion of gauge or Minkowski functional \(ρ_{\F}\) of an IPM and Proposition 35.4, which states that for any function \(f\), \[\begin{equation}\label{eq:IPM-ineq} \left| \int f d μ_1 - \int f d μ_2 \right| \le ρ_{\F}(f) d_{\F}(μ_1, μ_2). \end{equation}\]

Now, we define a notion of distance between models.

Definition 25.1 Given a function class \(\F\), we say that a model \(\hat {\ALPHABET M}\) is an \((ε,δ)\)-approximation of model \(\ALPHABET M\) if for all \((s,a) \in \ALPHABET S × \ALPHABET A\), we have:

  1. \(\ABS{ c(s,a) - \hat c(s,a) } \le ε\)
  2. \(d_{\F}( P(\cdot \mid s,a) , \hat P(\cdot \mid s,a) ) \le δ\).

Note that given any two models \(\ALPHABET M\) and \(\hat {\ALPHABET M}\), we can always say that \(\hat {\ALPHABET M}\) is an \((ε,δ)\) approximation of \(\ALPHABET M\) with \[ ε = \NORM{ c - \hat c }_{∞} \quad\text{and}\quad δ = \sup_{(s,a) \in \ALPHABET S × \ALPHABET A} d_{\F}( P(\cdot \mid s,a) , \hat P(\cdot \mid s,a) ). \]

An immediate implication of the above definition is the following.

Lemma 25.2 If \(\hat {\ALPHABET M}\) is an \((ε,δ)\) approximation of \(\ALPHABET M\) with respect to \(\F\), then for any \(v \colon \ALPHABET S \to \reals\) \[ \MISMATCH^{\max} v \le ε + γ δ \, ρ_{\F}(v). \]

From the definition of maximum Bellman mismatch, we have \[\begin{align*} \MISMATCH^{\max} v &= \max_{(s,a) \in \ALPHABET S, A} \biggl\lvert c(s,a) + γ \sum_{s' \in \ALPHABET S} P(s'|s,a)v(s') -\hat c(s,a) - γ \sum_{s' \in \ALPHABET S} \hat P(s'|s,a) v(s') \biggr\rvert \\ &\stackrel{(a)}\le \max_{(s,a) \in \ALPHABET S, A} \bigg\{ \bigl\lvert c(s,a) - \hat c(s,a) \bigr\rvert + γ \biggl\lvert \sum_{s' \in \ALPHABET S} P(s' | s,a) v(s') - \sum_{s' \in \ALPHABET S} \hat P(s' | s,a) v(s') \biggr| \biggr\} \\ &\stackrel{(b)}\le \max_{(s,a) \in \ALPHABET S, A} \Bigl\{ \bigl\lvert c(s,a) - \hat c(s,a) \bigr\rvert + γ ρ_{\F}(v) d_{\F}(P(\cdot | s,a), \hat P(\cdot | s,a) ) \Bigr\} \notag \\ &\stackrel{(c)}\le ε + γ δ ρ_{\F}(v) \end{align*}\] where \((a)\) follows from triangle inequality, \((b)\) follows from \(\eqref{eq:IPM-ineq}\) and \((c)\) follows from the definition of \((ε,δ)\).

An immediate consequence of Lemma 25.2 when combined with Theorem 25.2 and Theorem 25.1 is the following.

Theorem 25.3 (Model approximation error) If model \(\hat {\ALPHABET M}\) is an \((ε,δ)\)-approximation of model \(\ALPHABET M\), then the policy \(\hat π^*\) is an \(α\)-optimal policy of \(\ALPHABET M\) where \[ α := \| V^* - V^{\hat π^*} \|_∞ \le \frac{2}{1-γ} \bigl[ ε + γ δ ρ_{\F}(\hat V^*)\bigr]. \] Another upper bound on \(α\) is \[ α \le \frac{2}{(1-γ)^2} \bigl[ ε + γ δ ρ_{\F}(V^*)\bigr]. \]

Note that the above bounds require the knowledge of \(\hat V^*\). For specific choices of IPM, it is possible to obtain bounds which do not require the knowledge of \(\hat V^*\). We can get looser upper bounds which do not require explicit knowledge of \(\hat V^*\).

Corollary 25.1 (Instance independent model approximation error bounds)  

  1. If model \(\hat {\ALPHABET M}\) is an \((ε,δ)\)-approximation of model \(\ALPHABET M\) with respect to total variation, then the policy \(\hat π^*\) is an \(α\)-optimal policy of \(\ALPHABET M\) where \[ α \le \frac{2 ε}{1-γ} + \frac{ γ δ\SPAN(\hat r)}{(1-γ)^2}, \] or \[ α \le \frac{2 ε}{(1-γ)^2} + \frac{ γ δ\SPAN(r)}{(1-γ)^3}. \]

  2. If the approximation is with respect the the Wasserstein distance, then we have the following:

    1. If the approximate model \(\hat {\ALPHABET M}\) is \((\hat L_r, \hat L_P)\) Lipschitz (see Definition 19.1) with \(γ \hat L_P < 1\), then \[ α \le \frac{2}{1-γ}\biggl[ ε + \frac{γ δ\hat L_r}{1 - γ \hat L_{P}} \biggr] \]
    2. If the original model \({\ALPHABET M}\) is \((L_r, L_P)\) Lipschitz with \(γ L_P < 1\), then \[ α \le \frac{2}{(1-γ)^2}\biggl[ ε + \frac{γ δ L_r}{1 - γ L_{P}} \biggr] \]
  1. The result follows from Theorem 25.3 the observation that \(\SPAN(\hat V^*) \le (1-γ)^{-1} \SPAN(\hat r)\) and \(\SPAN(V^*) \le (1-γ)^{-1} \SPAN(r)\).

  2. The result follows from Theorem 25.3 and Theorem 19.1.

25.3 Example: Inventory management with incorrect demand distribution

Let’s consider the inventory management problem. Suppose that the demand process has a PDF \(f_W\) but we choose a policy \(\hat π^*\) according to an estimated demand PDF \(f_{\hat W}\). What is the model approximation error if the policy \(\hat π^*\) is used instead of \(π^*\)?

To upper bound the model approximation error, we use the instance independent bounds of Corollary 25.1 in terms of the Wasserstein distance. Note that since there is no error in modeling the per-step cost, \(ε = 0\). To bound \(δ\), note that pdf of the dynamics are given by: \[ p(\cdot | s,a) = f_W(\cdot -s+a) \] which is the same as the demand distribution shifted by \((s-a)\). Thus, \[ \ALPHABET K(p(\cdot | s,a), \hat p(\cdot | s,a)) = \ALPHABET K(f_W(\cdot - s + a), f_{\hat W}(\cdot - s + a)) = \ALPHABET K(f_W, f_{\hat W}) \] where the last equality uses the fact that shifting two distributions does not change their Wasserstein distance1. Thus, \[ δ = \ALPHABET K(f_W, f_{\hat W}) \] is the Wasserstein distance between the

1 For our 1-dimension setting, this fact can be seen immediately from the formula of Wasserstein distance in terms of the CDF of random varaibles: \[ \ALPHABET K(X,Y) = \int_{-∞}^∞ \ABS{ F_X(t) - F_Y(t) } dt. \] For a more general argument, see the next section.

We know from Example 19.1 that the inventory management model is \((p + \max\{c_h,cs\}, 1)\)-Lipschitz. Therefore, the approximate value function \(\hat V^*\) is \(\hat L_V\)-Lipschitz (see Theorem 19.1) with \[ \hat L_V \le \frac{p + \max\{c_h, c_s\}}{1 - γ}. \]

Therefore, we get that the policy \(\hat π^*\) is \(α\)-optimal, where \[ α \le \frac{2 γ}{(1-γ)^2}(p + \max\{c_h, c_s\}) \ALPHABET K(f_W, f_{\hat W}). \]

25.4 Example: Performance loss in using certainty equivalent control

Certainty equivalence refers to the following design methodology to determine a control policy for a stochastic control problem. Replace the random variables in the stochastic control problem by their (conditional) expectations, solve the resulting deterministic control problem to determine a feedback control policy, and use the resulting certainty equivalent control policy in the original stochastic system. See notes on linear quadratic regulation for an example where certainty equivalence is optimal.

But this is not the case in general. In this section, we use the results of Theorem 25.3 to characterize the performance loss when using certainty equivalence for dynamic models.

Consider a system with state space \(\reals^n\), action space \(\reals^m\), and dynamics \[\begin{equation}\label{eq:stochastic} S_{t+1} = f(S_t, A_t) + N_t \end{equation}\] where \(f\) is a measurable function and \(\{N_t\}_{t \ge 1}\) is a zero-mean i.i.d. noise sequence with control law \(\nu_N\). The per-step cost is given by \(c(S_t, A_t)\).

Now consider a deterministic model obtained by assuming that the noise sequence in \(\eqref{eq:stochastic}\) takes its expected value, i.e., the dynamics are \[\begin{equation}\label{eq:deterministic} S_{t+1} = f(S_t, A_t). \end{equation}\] The per-step cost is the same as before.

Let \(\ALPHABET M\) denote the stochastic model and \(\hat {\ALPHABET M}\) denote the deterministic model. Then, the certainty equivalent design is to use the control policy \(\hat \pi^*\) in original stochastic model \(\ALPHABET M\). We use the Wasserstein distance based bounds in Corollary 25.1 to bound \(\NORM{V^{\hat \pi^*} - V^*}_{∞}\). We assume that there is some norm \(\| \cdot \|\) on \(\reals^n\) and the Wasserstein distance and Lipschitz constant are computed with respect to this norm.

Since the costs are the same for both models, \(ε = 0\). We now characterize \(\delta\). For ease of notation, given random variables \(X\) and \(Y\) with probability laws \(\nu_X\) and \(\nu_Y\), we will use \(\ALPHABET K(X,Y)\) to denote \(\ALPHABET K(\nu_X, \nu_Y)\). Recall that Wasserstein distance is defined as (Villani et al. 2008) \[\begin{equation}\label{eq:Kantorovich} \ALPHABET K(\nu_X, \nu_Y) = \inf_{ \substack{ \tilde X \sim \nu_X \\ \tilde Y \sim \nu_Y} } \EXP[ \| \tilde X - \tilde Y \| ]. \end{equation}\] Now, for a fixed \((s,a)\), define \(X = f(s,a) + N\), where \(N \sim \nu_N\), and \(Y = f(s,a)\). Then, the Wasserstein distance between \(P(\cdot | s,a)\) and \(\hat P(\cdot | s,a)\) is equal to \(\ALPHABET K(X,Y)\), which by \(\eqref{eq:Kantorovich}\) equals \(\EXP[\| N \|]\), which does not depend on \((s,a)\). Thus, \[ δ = \EXP[\NORM{N}]. \]

Thus, by Corollary 25.1, we get \[\begin{equation}\label{eq:CE-bound} \NORM{ V^{\hat \pi^*} - V^*}_{∞} \le \frac{2\gamma}{1- \gamma} \EXP[ \| N \| ] L_{\hat V^*}. \end{equation}\] This bound precisely quantifies the engineering intuition that certainty equivalent control laws are good when the noise is “small”. This bound may be viewed as a generalization of the bounds on certainty equivalence for stochastic optimization presented earlier.

Notes

The material in this section is adapted from Bozkurt et al. (2023), where the results were presented for unbounded per-step cost. The IPM-based bounds of Theorem 25.3 are due to Müller (1997), but the proof is adapted from Bozkurt et al. (2023), where some generalizations of Theorem 25.3 are also presented. The total variation bound in Corollary 25.1 is due to Müller (1997). The Wasserstein distance based bound in Corollary 25.1 is due to Asadi et al. (2018).

The approximation bound for the inventory management example is from Müller (1997). The approximation bound for certainty equivalence is from Bozkurt et al. (2023).