35  Integral Probablity Metrics

Updated

December 18, 2023

Keywords

IPM

[TODO: Add introduction]

35.1 Definition of Integral Probability Metrics

Notation

  1. Let \((\ALPHABET X,d)\) be a complete separable metric space, i.e., a Polish space.

  2. Let \(w \colon \ALPHABET X \to [1,∞)\) be a measurable function and define the weighted norm of any function \(f \colon \ALPHABET X \to \reals\) as \[ \NORM{f}_w = \sup_{x \in \ALPHABET X} \frac{f(x)}{w(x)}. \]

  3. Let \(\ALPHABET M(\ALPHABET X)\) denote the set of all measurable real-valued functions on \(\ALPHABET X\) and let \(\ALPHABET M_w(\ALPHABET X)\) denote the subset with bounded \(w\)-norm, i.e., \[ \ALPHABET M_w(\ALPHABET X) = \{ f \in \ALPHABET M(\ALPHABET X) : \NORM{f}_w < ∞ \} \]

  4. Let \(\mathfrak X\) denote the Borel \(σ\)-algebra of \(\ALPHABET X\).

  5. Let \(\ALPHABET P(\ALPHABET X)\) denote the set of all probaility measures on \((\ALPHABET X, \mathfrak X)\) and \(\ALPHABET P_w(\ALPHABET X) = \{ μ \in \ALPHABET P(\ALPHABET X) : \int w dμ < ∞ \}\).

\(\def\F{\mathfrak F}\) Integral probability metrics (IPMs) are pseudo-metrics on \(\ALPHABET P_w(\ALPHABET X)\) defined as follows.

Definition 35.1 The integral probability metric \(d_{\mathfrak F}\) is a pseudo-metric on \(\ALPHABET P_w(\ALPHABET X)\) generated by the function class \(\F \subset \ALPHABET M_w(\ALPHABET X)\) is defined as \[ d_{\F}(μ_1, μ_2) = \sup_{f \in \F} \biggl| \int f d μ_1 - \int f d μ_2 \biggr|, \quad \forall μ_1, μ_2 \in \ALPHABET P_w(\ALPHABET X). \] The function \(f\) is called the witness function and the function class \(\F\) is called the generator.

An IPM is a pseudo-metric, i.e., it satisfies the following properties: for all \(μ_1, μ_2, μ_3 \in \ALPHABET P_w(\ALPHABET X)\)

  1. \(d_{\F}(μ_1, μ_1) = 0\)
  2. Symmetry. \(d_{\F}(μ_1, μ_2) = d_{\F}(μ_2, μ_1)\)
  3. Triangle inequality. \(d_{\F}(μ_1, μ_3) \le d_{\F}(μ_1, μ_2) + d_{\F}(μ_2, μ_3)\).

IPM is a metric when \(d(μ_1, μ_2) = 0 \implies μ_1 = μ_2\).

Examples

  1. The Kolmogorov-Smirnov metric. The Kolmogorov-Smirnov metric on \(\reals\) is defined as \[ \text{KS}(μ, ν) = \sup_{t \in \reals} \ABS{ M(t) - N(t) } \] where \(M\) and \(N\) are CDFs corresponding to \(μ\) and \(ν\), respectively. It has the following properties:

    • \(\text{KS}(μ,ν) \in [0, 1]\)
    • \(\text{KS}(μ,ν) = 0\) iff \(μ = ν\)
    • \(\text{KS}(μ, ν) = 1\) iff the two distributions have disjoint support, i.e., there exists a \(s \in \reals\) such that \(μ( (-∞, s] ) = 1\) and \(ν((-∞, s]) = 0\)
    • \(\text{KS}\) is a metric

    The Kolmogorov-Smirnov metric is an IPM with \(\F = \{ \IND_{[t, ∞)} : t \in \reals \}\).

  2. Total variation metric Let \((\ALPHABET S, \mathfrak S)\) be a probability space. The total variation metric between two probability measures \(μ\) and \(ν\) on \((\ALPHABET S, \mathfrak S)\) is defined as \[ \text{TV}(μ,ν) = \sup_{B \in \mathfrak S} \ABS{ μ(B) - ν(B) }. \] It has the following properties:

    • \(\text{TV}(μ,ν) \in [0, 1]\)
    • \(\text{TV}(μ,ν) = 0\) iff \(μ = ν\)
    • \(\text{TV}(μ, ν) = 1\) iff there exists an event \(B \in \mathfrak S\) such that \(μ(B) = 1\) and \(μ(B) = 0\).
    • \(\text{TV}\) is a metric

    When \(\ALPHABET S = \reals^k\) the measures \(μ\) and \(ν\) have densities \(f\) and \(g\) with respect to a measure \(λ\), then

    • \(\text{TV}(μ,ν) = \tfrac 12 \int \ABS{ f(x) - g(x) } λ(dx)\).
    • \(\text{TV}(μ,ν) = 1 - \int \min\{ f(x), g(x) \} λ(dx)\).
    • \(\text{TV}(μ,ν) = μ(B) - ν(B)\), where \(B = \{ x : f(x) \ge g(x) \}\).

    One may view the total variation metric as the total variation norm of the signed measure \(μ - ν\). An implication of the first property is that total variation can be expressed as an IPM with the generator \[ \F = \{ f : \NORM{f}_{∞} \le 1 \}. \]

  3. Kantorovich-Rubinstein metric or Wasserstein distance.

  4. Maximum-mean discrepency.

  5. Weighted total variation.

35.2 Maximal generators

So far, we haven’t imposed any restriction on the generator \(\F\). We now present a canonical representation of a generator.

Definition 35.2 For a function class \(\F \in \ALPHABET M_w(\ALPHABET X)\), define the maximal generator \(\ALPHABET G_{\F}\) as \[ \ALPHABET G_{\F} = \biggl\{ f \in \ALPHABET M_w(\ALPHABET X) : \biggl| \int f d μ_1 - \int f d μ_2 \biggr| \le d_{\F}(μ_1, μ_2), \space \text{for all } μ_1, μ_2 \in \ALPHABET P_w(\ALPHABET X) \biggr\}. \]

Maximal generators have the following properties.

Proposition 35.1 Let \(\F \subset \ALPHABET M_w(\ALPHABET X)\) be an arbitary generator. Then:

  1. \(\ALPHABET G_{\F}\) contains the convex hull of \(\F\).
  2. If \(f \in \ALPHABET G_{\F}\), then \(α f + β \in \ALPHABET G_{\F}\) for all \(|α| \le 1\) and \(β \in \reals\).
  3. If the sequence \(\{f_n\}_{n \ge 1} \in \ALPHABET G_{\F}\) converges uniformly to \(f\), then \(f \in \ALPHABET G_{\F}\).

Maximal generators also allow us to compare IPMs with different generators.

Balanced set

A set \(\ALPHABET S\) is said to be balanced if for all scalars \(a\) satisfying \(|a| \le 1\), we have \(a \ALPHABET S \subset \ALPHABET S\).

Proposition 35.2 Let \(\F \subset \mathfrak D \subset \ALPHABET M_w(\ALPHABET X)\) and \(μ_1, μ_2 \in \ALPHABET P_w(\ALPHABET X)\). Then

  1. \(d_{\F}(μ_1, μ_2) \le d_{\mathfrak D}(μ_1, μ_2)\).
  2. \(\ALPHABET G_{\F} \subset \ALPHABET G_{\mathfrak D}\).
  3. If \(\mathfrak D \subset \ALPHABET G_{\F}\) then \(d_{\F}\) and \(d_{\mathfrak D}\) are identical.
  4. If \(\F \subset \mathfrak D \subset \ALPHABET G_{\F}\) and \(\mathfrak D\) is convex and balanced, contains the constant functions, and is closed with respect to pointwise convergence, then \(\mathfrak D = \ALPHABET G_{\F}\).

Examples

  1. Kolmogorov-Smirnov metric. The maximal generator of Kolmogorov-Smirnov metric is \[ \ALPHABET G_{\F} = \{ f : V(f) \le 1 \} \] where \(V(f)\) denotes the variation of function \(f\), which is defined as \[ V(f) = \sup_{P} \sum_{k=1}^{n} \ABS{ f(x_k) - f(x_{k-1}) }. \] where the suprimum is over all partitions \(P = \{x_0, \dots, x_n\}\) of \(\reals\).

  2. Total-variation metric. The maximal generator of total-variation metric is \[ \ALPHABET G_{\F} = \{ f : \tfrac 12 \SPAN(f) \le 1 \} \]

  3. Kantorovich-Rubinstein metric or Wasserstein distance.

  4. Maximum-mean discrepency.

  5. Weighted total variation.

35.3 Minkowski functional or gauge of an IPM

We will assume that the generator \(\F\) is the maximal generator. In particular, this means that \(0 \in \F\) is an interior point. Define \((0,∞)\F \coloneqq \cup_{0 < r < ∞} r\F\), which is a subset of \(\ALPHABET M_w(\ALPHABET X)\).

For any function \(f \in \ALPHABET M_w(\ALPHABET X)\) define the Minkowski functional (or gauge) as follows: \[ ρ_{\F}(f) = \inf \big\{ ρ > 0 : f/ρ \in \F \bigr\}. \] where the infimum of an empty set is defined as infinity. Therefore, \(ρ_{\F}(f) < ∞\) if \(f \in (0,∞)\F\); otherwise \(ρ_{\F}(f) = ∞\).

In all the above examples, the Minkowski functional is a semi-norm, which is a general property.

Proposition 35.3 The Minkowski functional is a semi-norm, i.e., it satisfies the following properties:

  1. Non-negativity. \(ρ(f) \ge 0\).
  2. Absolute homogeneity \(ρ_{\F}(a f) = |a| ρ_{\F}(f)\) for all scalars \(a\).
  3. Subadditivity. \(ρ_{\F}(f_1 + f_2) \le ρ_{\F}(f_1) + ρ_{\F}(f_2)\).

Non-negativity holds by definition. Absolute homogeneity holds because \(\F\) is balanced, in particular \(\F = -\F\). Note that absolute homogeneity implies \(ρ_{\F}(0) = 0\).

To see why subadditivity holds, let \(ρ_1 = ρ_{\F}(f_1)\) and \(ρ_2 = ρ_{\F}(f_2)\). Then, since \(\F\) is convex, we have for any \(λ \in [0,1]\), \[ λ \frac{f_1}{ρ_1} + (1-λ) \frac{f_2}{ρ_2} \in \F. \] Take \(λ = ρ_1/(ρ_1 + ρ_2)\), which implies \((f_1 + f_2)/(ρ_1 + ρ_2) \in \F\). Thus, \(ρ_{\F}(f_1 + f_2) \le ρ_1 + ρ_2\).

Proposition 35.4 For any \(f \in (0,∞)\F\) such that \(ρ_{\F}(f) \neq 0\) and any \(μ_1, μ_2 \in \ALPHABET P_w(\ALPHABET X)\), we have \[ \left| \int f d μ_1 - \int f d μ_2 \right| \le ρ_{\F}(f) d_{\F}(μ_1, μ_2). \]

By the assumptions imposed on \(f\), we know that \(ρ_{\F}(ρ) \in (0,∞)\). Define \(f' = f/ρ_{\F}(f) \in \F\). Then, \[ \left| \int f d μ_1 - \int f d μ_2 \right| = ρ_{\F}(f) \left| \int f' d μ_1 - \int f' d μ_2 \right| \le ρ_{\F}(f) d_{\F}(μ_1, μ_2) \] where the last inequality uses the fact that \(f' \in \F\).

Why care about maximal generators

Proposition 35.2 implies that the maximal generator of \(\F\) is the smallest superset of \(\F\) that is convex and balanced, contains the constant functions, and is closed with respect to pointwise convergence. The IPM distance wrt a generator and its maximal generator are the same but the Minkowski functionals are not! In particular if \(\F \subset \mathfrak D\), then \(ρ_{\F}(f) \ge ρ_{\mathfrak D}(f)\). So, working with the maximal generators gives us the tightest version of the bound in Proposition 35.4.

For example, we know that \(\{ f : \NORM{f}_{∞} \le 1 \}\) is a generator for total-variation distance. Using this, the result of Proposition 35.4 implies that \[\begin{equation}\label{eq:IPM-TV-1} \left| \int f d μ_1 - \int f d μ_2 \right| \le \NORM{f}_{∞} \text{TV}(μ_1, μ_2). \end{equation}\] Note that the left hand side is invariant to addition by a constant, but the right hand side is not. Therefore, this bound cannot be tight.

Now consider the maximal generator for total-variation distance is \(\{ f : \tfrac 12 \SPAN{f} \le 1 \}\). Using this, the result of Proposition 35.4 implies that \[\begin{equation}\label{eq:IPM-TV-2} \left| \int f d μ_1 - \int f d μ_2 \right| \le \tfrac 12 \SPAN(f)\text{TV}(μ_1, μ_2). \end{equation}\] In this case, the right hand side is invariant to additional by a constant. Moreover, \(\NORM{f}_{∞} \le \tfrac 12 \SPAN(f)\) (see Exercise 35.1). Therefore, the bound of \(\eqref{eq:IPM-TV-2}\) is tighter than that of \(\eqref{eq:IPM-TV-1}\).

Examples

  1. Kolmogorov-Smirnov metric. The Minkowski functional of a function \(f\) for Kolmogorov-Smirnov metric is its total variation \(V(f)\). Thus, for any function with bounded total variation: \[ \biggl| \int f d μ - \int f dν \biggr| \le \text{KS}(μ,v) V(f). \]

  2. Total variation metric. The Minkowski functional of a function \(f\) for total variation metric is \(\tfrac \SPAN(f)\). Thus, for any function with bounded span: \[ \biggl| \int f d μ - \int f dν \biggr| \le \text{TV}(μ,v) \tfrac 12 \SPAN(f). \]

  3. Kantorovich-Rubinstein metric or Wasserstein distance.

  4. Maximum-mean discrepency.

  5. Weighted total variation.

35.4 Properties of IPMs

Definition 35.3 Let \((\ALPHABET S, d)\) be a metric vector space, \(w \colon \ALPHABET S \to [1, ∞)\) a weight function, and \(d_{\F}\) any semi-metric on \(\ALPHABET P_w\). Then, \(d_{\F}\) is said to have

  • Property (R) if \(d_{\F}(δ_a, δ_b) = d(a,b)\).
  • Property (M) if \(d_{\F}(aX, aY) = a d_{\F}(X,Y)\).
  • Property (C) if \(d_{\F}(P_1 * Q, P_2 * Q) \le d_{\F}(P_1, P_2)\) for all probability measures \(Q\).

The next result is from Müller (1997), Theorem~4.7.

Suppose \(d_{\F}\) is an IPM with a maximal generator \(\ALPHABET G_{\F}\). Then

  1. Property (R) holds if and only if \[ \sup_{f \in \ALPHABET F} \frac{ \ABS{ f(x) - f(y) } }{ d(x,y) } = 1 , \quad \text{for all } x,y \in \ALPHABET S, x \neq y. \]
  2. Property (C) holds if any only if \(\ALPHABET G_{\F}\) is invariant under translations.

Exercises

Exercise 35.1 Show that for any function \(f\): \[ \tfrac 12 \SPAN(f) \le \NORM{f}_{∞}. \]

Hint: Suppose \(f\) is such that \(\max(f) = - \min(f)\), i.e., the function is centered at \(0\). Then, the inequality holds with equality. What can you say when the function is not centered at \(0\)?

Notes

IPMs were first defined by Zolotarev (1984), where they were called probability metrics with \(ζ\)-structure. They are studied in detail in Rachev (1991). The term IPM was coined by Müller (1997). The discussion here is mostly borrowed from Müller (1997).