36  Integral Probablity Metrics

Updated

July 19, 2024

Keywords

IPM

[TODO: Add introduction]

36.1 Definition of Integral Probability Metrics

Notation

  1. Let \((\ALPHABET X,d)\) be a complete separable metric space, i.e., a Polish space.

  2. Let \(w \colon \ALPHABET X \to [1,∞)\) be a measurable function and define the weighted norm of any function \(f \colon \ALPHABET X \to \reals\) as \[ \NORM{f}_w = \sup_{x \in \ALPHABET X} \frac{f(x)}{w(x)}. \]

  3. Let \(\ALPHABET M(\ALPHABET X)\) denote the set of all measurable real-valued functions on \(\ALPHABET X\) and let \(\ALPHABET M_w(\ALPHABET X)\) denote the subset with bounded \(w\)-norm, i.e., \[ \ALPHABET M_w(\ALPHABET X) = \{ f \in \ALPHABET M(\ALPHABET X) : \NORM{f}_w < ∞ \} \]

  4. Let \(\mathfrak X\) denote the Borel \(σ\)-algebra of \(\ALPHABET X\).

  5. Let \(\ALPHABET P(\ALPHABET X)\) denote the set of all probaility measures on \((\ALPHABET X, \mathfrak X)\) and \(\ALPHABET P_w(\ALPHABET X) = \{ μ \in \ALPHABET P(\ALPHABET X) : \int w dμ < ∞ \}\).

\(\def\F{\mathfrak F}\) Integral probability metrics (IPMs) are pseudo-metrics on \(\ALPHABET P_w(\ALPHABET X)\) defined as follows.

Definition 36.1 The integral probability metric \(d_{\mathfrak F}\) is a pseudo-metric on \(\ALPHABET P_w(\ALPHABET X)\) generated by the function class \(\F \subset \ALPHABET M_w(\ALPHABET X)\) is defined as \[ d_{\F}(μ_1, μ_2) = \sup_{f \in \F} \biggl| \int f d μ_1 - \int f d μ_2 \biggr|, \quad \forall μ_1, μ_2 \in \ALPHABET P_w(\ALPHABET X). \] The function \(f\) is called the witness function and the function class \(\F\) is called the generator.

An IPM is a pseudo-metric, i.e., it satisfies the following properties: for all \(μ_1, μ_2, μ_3 \in \ALPHABET P_w(\ALPHABET X)\)

  1. \(d_{\F}(μ_1, μ_1) = 0\)
  2. Symmetry. \(d_{\F}(μ_1, μ_2) = d_{\F}(μ_2, μ_1)\)
  3. Triangle inequality. \(d_{\F}(μ_1, μ_3) \le d_{\F}(μ_1, μ_2) + d_{\F}(μ_2, μ_3)\).

IPM is a metric when \(d(μ_1, μ_2) = 0 \implies μ_1 = μ_2\).

Examples

  1. The Kolmogorov-Smirnov metric. The Kolmogorov-Smirnov metric on \(\reals\) is defined as \[ \text{KS}(μ, ν) = \sup_{t \in \reals} \ABS{ M(t) - N(t) } \] where \(M\) and \(N\) are CDFs corresponding to \(μ\) and \(ν\), respectively. It has the following properties:

    • \(\text{KS}(μ,ν) \in [0, 1]\)
    • \(\text{KS}(μ,ν) = 0\) iff \(μ = ν\)
    • \(\text{KS}(μ, ν) = 1\) iff the two distributions have disjoint support, i.e., there exists a \(s \in \reals\) such that \(μ( (-∞, s] ) = 1\) and \(ν((-∞, s]) = 0\)
    • \(\text{KS}\) is a metric

    The Kolmogorov-Smirnov metric is an IPM with \(\F = \{ \IND_{[t, ∞)} : t \in \reals \}\).

  2. Total variation metric Let \((\ALPHABET S, \mathfrak S)\) be a probability space. The total variation metric between two probability measures \(μ\) and \(ν\) on \((\ALPHABET S, \mathfrak S)\) is defined as \[ \text{TV}(μ,ν) = \sup_{B \in \mathfrak S} \ABS{ μ(B) - ν(B) }. \] It has the following properties:

    • \(\text{TV}(μ,ν) \in [0, 1]\)
    • \(\text{TV}(μ,ν) = 0\) iff \(μ = ν\)
    • \(\text{TV}(μ, ν) = 1\) iff there exists an event \(B \in \mathfrak S\) such that \(μ(B) = 1\) and \(μ(B) = 0\).
    • \(\text{TV}\) is a metric

    When \(\ALPHABET S = \reals^k\) the measures \(μ\) and \(ν\) have densities \(f\) and \(g\) with respect to a measure \(λ\), then

    • \(\text{TV}(μ,ν) = \tfrac 12 \int \ABS{ f(x) - g(x) } λ(dx)\).
    • \(\text{TV}(μ,ν) = 1 - \int \min\{ f(x), g(x) \} λ(dx)\).
    • \(\text{TV}(μ,ν) = μ(B) - ν(B)\), where \(B = \{ x : f(x) \ge g(x) \}\).

    One may view the total variation metric as the total variation norm of the signed measure \(μ - ν\). An implication of the first property is that total variation can be expressed as an IPM with the generator \[ \F = \{ f : \NORM{f}_{∞} \le 1 \}. \]

  3. Kantorovich-Rubinstein metric or Wasserstein distance. Let \((\ALPHABET S, \mathfrak S)\) be a probability space and \(d_{\ALPHABET S}\) be a metric on \(\ALPHABET S\). Let \(p \in [1, +\infty]\). Then the Wasserstein \(p\)-distance between two measures \(μ\) and \(ν\) on \((\ALPHABET S, \mathfrak S)\) is defined as \[ \text{Was}_p(μ,ν) = \left(\inf_{γ \in Γ(μ,ν)} \int_{\ALPHABET S × \ALPHABET S} d_{\ALPHABET S}(x,y) d γ(x,y) \right)^{1/p} \] where \(Γ(μ,ν)\) are all couplings (i.e., joint measures on \((\ALPHABET S \times \ALPHABET S, \mathfrak S \otimes \mathfrak S)\) with marginals \(μ\) and \(ν\). An compact way to write this is \[ \text{Was}_p(μ,ν) = \left( \inf_{γ \in Γ(μ,ν)} \EXP_{(X,Y) \sim γ} \bigl[ d_{\ALPHABET S}(X,Y)^p \bigr] \right)^{1/p} \]

    In this section, we focus on \(\text{Was}_1\) distance (i.e., \(p = 1\)), and will drop the subscript \(1\) for convenience and write \[ \text{Was}(μ,ν) = \inf_{γ \in Γ(μ,ν)} \EXP_{(X,Y) \sim γ} \bigl[ d_{\ALPHABET S}(X,Y) \bigr]. \]

    Theorem 36.1 (Kantorovich Rubinstein duality) The Wasserstein distance is equivalent to the following: \[ \text{Was}(μ,ν) = \sup_{f : \| f \|_{L} \le 1} \Bigl| \EXP_{X \sim μ}[ f(X) ] - \EXP_{Y \sim ν}[ f(Y) ] \Bigr| \]

    The orignal definition is an (infinite dimensional) linear program: \(\EXP_{(X,Y) \sim γ}\bigl[ d_{\ALPHABET S}(X,Y) \bigr]\) is a linear function of \(γ\) and the of all couplings \(Γ(μ,ν)\) is a convex polytope. Taking its dual, we get \[ \text{Was}(μ,ν) = \sup_{f, g \colon \ALPHABET S \to \reals} \bigl[ \EXP_{X \sim μ}[ f(X) ] + \EXP_{Y \sim ν}[ g(Y) ] \bigr] \quad\text{such that}\quad f(x) + g(y) \le d(x,y). \]

    Similar in spirit to Legendre transform define the \(d_{\ALPHABET S}\)-dual of the function \(g\) as \[ g^*(x) = \inf_{u \in \ALPHABET S} \biggl[ d(x,u) - g(u) \biggr] \] Observe that for any \(u\) \[ g^*(x) \le d(x,u) - g(u) \le d(x,y) + d(y,u) - g(u). \] Therefore, \[ g^*(x) \le d(x,y) + \min_{u \in \ALPHABET S} \biggl[ d(y,u) - g(u) \biggr] = d(x,y) + g^*(y). \] Thus, \(g^*(x) - g^*(y) \le d_{\ALPHABET S}(x,y)\). Interchanging the role of \(x\) and \(y\), we get \[ | g^*(x) - g^*(y) | \le d_{\ALPHABET S}(x,y). \] Thus, \(\| g^* \|_{L} \le 1\).

    Now consider any \(f\) and \(g\) that satisfies the constraint \(f(x) + g(y) \le d(x,y)\) for all \(x,y\). For any \(x\), let \(u^*\) be the value of \(u\) that achieves the minimum in the definition of \(g^*(x)\), i.e., \[ g^*(x) = d(x,u^*) - g(u^*) \ge f(x). \] Moreover, \[ g^*(x) \le d(x,x) - g(x) = -g(x). \] Thus, for any \(f\) and \(g\) that satisfy the constraint, we have \[\begin{align} \text{Was}(μ,ν) &= \sup_{ f, g : f(x) + g(y) \le d_{\ALPHABET S}(x,y) } \bigl[ \EXP_{X \sim μ}[ f(X) ] + \EXP_{Y \sim ν}[ g(Y) ] \bigr] \notag \\ &\le \sup_{ g^* } \bigl[ \EXP_{X \sim μ}[ g^*(X) ] - \EXP_{Y \sim ν}[ g^*(Y) ] \bigr] \notag \\ &\le \sup_{ f : \|f\|_{L} \le 1} \bigl[ \EXP_{X \sim μ}[ f(X) ] - \EXP_{Y \sim ν}[ f(Y) ] \bigr] \label{eq:lower-bound} \end{align}\] where the last inequality follows because we have shown that \(\|g^*\|_{L} \le 1\).

    Now observe that for any \(f\) such that \(\|f\|_{L} \le 1\), we have \(f(x) - f(y) \le d(x,y)\). Hence, for any coupling \(γ \in Γ(μ,ν)\), we have \[ \EXP_{(X,Y) \sim γ}[ d(X,Y) ] \ge \EXP_{(X,Y) \sim γ}[ f(X) - f(Y) ] = \EXP_{X \sim μ}[ f(X) ] - \EXP_{Y \sim ν} [ f(Y) ]. \] Thus, for any function \(f\) such that \(\|f\|_{L} \le 1\), we have \[ \EXP_{X \sim μ}[ f(X) ] - \EXP_{Y \sim ν} [ f(Y) ] \le \inf_{γ \in Γ(μ,ν)} \EXP_{(X,Y) \sim γ}[ d(X,Y) ] = \text{Was}(μ,ν). \] Consequently, \[\begin{equation}\label{eq:upper-bound} \sup_{ f : \|f\|_{L} \le 1} \bigl[ \EXP_{X \sim μ}[ f(X) ] - \EXP_{Y \sim ν}[ f(Y) ] \bigr] \le \text{Was}(μ,ν). \end{equation}\]

    The result then follows from \(\eqref{eq:lower-bound}\) and \(\eqref{eq:upper-bound}\).

    Remark about absolute value

    Note that in the proof, we have shown that \[ \text{Was}(μ,ν) = \sup_{f : \| f \|_{L} \le 1} \EXP_{X \sim μ}[ f(X) ] - \EXP_{Y \sim ν}[ f(Y) ] \] but in Theorem 36.1, we use absolute value in the right hand side. This is because (using a slightly terse notation) \[ \left| \int f d μ - \int f d ν \right| \text{ is either} \int f d μ - \int f d ν \text{ or } \int (-f) d μ - \int (-f) d ν \] and both \(f\) and \(-f\) have the same Lipschitz constant. So \[ \sup_{f : \|f\|_L \le 1} \left[ \int f d μ - \int f d ν \right] = \sup_{f : \|f\|_L \le 1} \left| \int f d μ - \int f d ν \right| \]

    Theorem 36.1 implies that Wasserstein distance can be expressed as an IPM with the generator \[ \mathfrak F = \{ f : \| f\|_{L} \le 1 \}. \]

  4. Maximum-mean discrepency.

    TBW

  5. Weighted total variation.

    TBW

36.2 Maximal generators

So far, we haven’t imposed any restriction on the generator \(\F\). We now present a canonical representation of a generator.

Definition 36.2 For a function class \(\F \in \ALPHABET M_w(\ALPHABET X)\), define the maximal generator \(\ALPHABET G_{\F}\) as \[ \ALPHABET G_{\F} = \biggl\{ f \in \ALPHABET M_w(\ALPHABET X) : \biggl| \int f d μ_1 - \int f d μ_2 \biggr| \le d_{\F}(μ_1, μ_2), \space \text{for all } μ_1, μ_2 \in \ALPHABET P_w(\ALPHABET X) \biggr\}. \]

Maximal generators have the following properties.

Proposition 36.1 Let \(\F \subset \ALPHABET M_w(\ALPHABET X)\) be an arbitary generator. Then:

  1. \(\ALPHABET G_{\F}\) contains the convex hull of \(\F\).
  2. If \(f \in \ALPHABET G_{\F}\), then \(α f + β \in \ALPHABET G_{\F}\) for all \(|α| \le 1\) and \(β \in \reals\).
  3. If the sequence \(\{f_n\}_{n \ge 1} \in \ALPHABET G_{\F}\) converges uniformly to \(f\), then \(f \in \ALPHABET G_{\F}\).

Maximal generators also allow us to compare IPMs with different generators.

Balanced set

A set \(\ALPHABET S\) is said to be balanced if for all scalars \(a\) satisfying \(|a| \le 1\), we have \(a \ALPHABET S \subset \ALPHABET S\).

Proposition 36.2 Let \(\F \subset \mathfrak D \subset \ALPHABET M_w(\ALPHABET X)\) and \(μ_1, μ_2 \in \ALPHABET P_w(\ALPHABET X)\). Then

  1. \(d_{\F}(μ_1, μ_2) \le d_{\mathfrak D}(μ_1, μ_2)\).
  2. \(\ALPHABET G_{\F} \subset \ALPHABET G_{\mathfrak D}\).
  3. If \(\mathfrak D \subset \ALPHABET G_{\F}\) then \(d_{\F}\) and \(d_{\mathfrak D}\) are identical.
  4. If \(\F \subset \mathfrak D \subset \ALPHABET G_{\F}\) and \(\mathfrak D\) is convex and balanced, contains the constant functions, and is closed with respect to pointwise convergence, then \(\mathfrak D = \ALPHABET G_{\F}\).

Examples

  1. Kolmogorov-Smirnov metric. The maximal generator of Kolmogorov-Smirnov metric is \[ \ALPHABET G_{\F} = \{ f : V(f) \le 1 \} \] where \(V(f)\) denotes the variation of function \(f\), which is defined as \[ V(f) = \sup_{P} \sum_{k=1}^{n} \ABS{ f(x_k) - f(x_{k-1}) }. \] where the suprimum is over all partitions \(P = \{x_0, \dots, x_n\}\) of \(\reals\).

  2. Total-variation metric. The maximal generator of total-variation metric is \[ \ALPHABET G_{\F} = \{ f : \tfrac 12 \SPAN(f) \le 1 \} \]

  3. Kantorovich-Rubinstein metric or Wasserstein distance.

    The generator of Wasserstein distance given above is actually a maximal generator. Thus, \[ \ALPHABET G_{\F} = \{ f : \| f\|_{L} \le 1 \} \]

  4. Maximum-mean discrepency.

  5. Weighted total variation.

36.3 Minkowski functional or gauge of an IPM

We will assume that the generator \(\F\) is the maximal generator. In particular, this means that \(0 \in \F\) is an interior point. Define \((0,∞)\F \coloneqq \cup_{0 < r < ∞} r\F\), which is a subset of \(\ALPHABET M_w(\ALPHABET X)\).

For any function \(f \in \ALPHABET M_w(\ALPHABET X)\) define the Minkowski functional (or gauge) as follows: \[ ρ_{\F}(f) = \inf \big\{ ρ > 0 : f/ρ \in \F \bigr\}. \] where the infimum of an empty set is defined as infinity. Therefore, \(ρ_{\F}(f) < ∞\) if \(f \in (0,∞)\F\); otherwise \(ρ_{\F}(f) = ∞\).

In all the above examples, the Minkowski functional is a semi-norm, which is a general property.

Proposition 36.3 The Minkowski functional is a semi-norm, i.e., it satisfies the following properties:

  1. Non-negativity. \(ρ(f) \ge 0\).
  2. Absolute homogeneity \(ρ_{\F}(a f) = |a| ρ_{\F}(f)\) for all scalars \(a\).
  3. Subadditivity. \(ρ_{\F}(f_1 + f_2) \le ρ_{\F}(f_1) + ρ_{\F}(f_2)\).

Non-negativity holds by definition. Absolute homogeneity holds because \(\F\) is balanced, in particular \(\F = -\F\). Note that absolute homogeneity implies \(ρ_{\F}(0) = 0\).

To see why subadditivity holds, let \(ρ_1 = ρ_{\F}(f_1)\) and \(ρ_2 = ρ_{\F}(f_2)\). Then, since \(\F\) is convex, we have for any \(λ \in [0,1]\), \[ λ \frac{f_1}{ρ_1} + (1-λ) \frac{f_2}{ρ_2} \in \F. \] Take \(λ = ρ_1/(ρ_1 + ρ_2)\), which implies \((f_1 + f_2)/(ρ_1 + ρ_2) \in \F\). Thus, \(ρ_{\F}(f_1 + f_2) \le ρ_1 + ρ_2\).

Proposition 36.4 For any \(f \in (0,∞)\F\) such that \(ρ_{\F}(f) \neq 0\) and any \(μ_1, μ_2 \in \ALPHABET P_w(\ALPHABET X)\), we have \[ \left| \int f d μ_1 - \int f d μ_2 \right| \le ρ_{\F}(f) d_{\F}(μ_1, μ_2). \]

By the assumptions imposed on \(f\), we know that \(ρ_{\F}(ρ) \in (0,∞)\). Define \(f' = f/ρ_{\F}(f) \in \F\). Then, \[ \left| \int f d μ_1 - \int f d μ_2 \right| = ρ_{\F}(f) \left| \int f' d μ_1 - \int f' d μ_2 \right| \le ρ_{\F}(f) d_{\F}(μ_1, μ_2) \] where the last inequality uses the fact that \(f' \in \F\).

Why care about maximal generators

Proposition 36.2 implies that the maximal generator of \(\F\) is the smallest superset of \(\F\) that is convex and balanced, contains the constant functions, and is closed with respect to pointwise convergence. The IPM distance wrt a generator and its maximal generator are the same but the Minkowski functionals are not! In particular if \(\F \subset \mathfrak D\), then \(ρ_{\F}(f) \ge ρ_{\mathfrak D}(f)\). So, working with the maximal generators gives us the tightest version of the bound in Proposition 36.4.

For example, we know that \(\{ f : \NORM{f}_{∞} \le 1 \}\) is a generator for total-variation distance. Using this, the result of Proposition 36.4 implies that \[\begin{equation}\label{eq:IPM-TV-1} \left| \int f d μ_1 - \int f d μ_2 \right| \le \NORM{f}_{∞} \text{TV}(μ_1, μ_2). \end{equation}\] Note that the left hand side is invariant to addition by a constant, but the right hand side is not. Therefore, this bound cannot be tight.

Now consider the maximal generator for total-variation distance is \(\{ f : \tfrac 12 \SPAN{f} \le 1 \}\). Using this, the result of Proposition 36.4 implies that \[\begin{equation}\label{eq:IPM-TV-2} \left| \int f d μ_1 - \int f d μ_2 \right| \le \tfrac 12 \SPAN(f)\text{TV}(μ_1, μ_2). \end{equation}\] In this case, the right hand side is invariant to additional by a constant. Moreover, \(\NORM{f}_{∞} \le \tfrac 12 \SPAN(f)\) (see Exercise 36.1). Therefore, the bound of \(\eqref{eq:IPM-TV-2}\) is tighter than that of \(\eqref{eq:IPM-TV-1}\).

Examples

  1. Kolmogorov-Smirnov metric. The Minkowski functional of a function \(f\) for Kolmogorov-Smirnov metric is its total variation \(V(f)\). Thus, for any function with bounded total variation: \[ \biggl| \int f d μ - \int f dν \biggr| \le \text{KS}(μ,v) V(f). \]

  2. Total variation metric. The Minkowski functional of a function \(f\) for total variation metric is \(\tfrac 12 \SPAN(f)\). Thus, for any function with bounded span: \[ \biggl| \int f d μ - \int f dν \biggr| \le \text{TV}(μ,v) \tfrac 12 \SPAN(f). \]

  3. Kantorovich-Rubinstein metric or Wasserstein distance. The Minkowski functional of a function \(f\) for Wasserstein distance is \(\|f\|_{L}\). Thus, for any function with bounded Lipschitz constant, we have \[ \biggl| \int f d μ - \int f dν \biggr| \le \text{Was}(μ,v) \|f\|_{L}. \]

  4. Maximum-mean discrepency.

  5. Weighted total variation.

36.4 Properties of IPMs

Definition 36.3 Let \((\ALPHABET S, d)\) be a metric vector space, \(w \colon \ALPHABET S \to [1, ∞)\) a weight function, and \(d_{\F}\) any semi-metric on \(\ALPHABET P_w\). Then, \(d_{\F}\) is said to have

  • Property (R) if \(d_{\F}(δ_a, δ_b) = d(a,b)\).
  • Property (M) if \(d_{\F}(aX, aY) = a d_{\F}(X,Y)\).
  • Property (C) if \(d_{\F}(P_1 * Q, P_2 * Q) \le d_{\F}(P_1, P_2)\) for all probability measures \(Q\).

The next result is from Müller (1997), Theorem~4.7.

Suppose \(d_{\F}\) is an IPM with a maximal generator \(\ALPHABET G_{\F}\). Then

  1. Property (R) holds if and only if \[ \sup_{f \in \ALPHABET F} \frac{ \ABS{ f(x) - f(y) } }{ d(x,y) } = 1 , \quad \text{for all } x,y \in \ALPHABET S, x \neq y. \]
  2. Property (C) holds if any only if \(\ALPHABET G_{\F}\) is invariant under translations.

Exercises

Exercise 36.1 Show that for any function \(f\): \[ \tfrac 12 \SPAN(f) \le \NORM{f}_{∞}. \]

Hint: Suppose \(f\) is such that \(\max(f) = - \min(f)\), i.e., the function is centered at \(0\). Then, the inequality holds with equality. What can you say when the function is not centered at \(0\)?

Exercise 36.2 (Convexity of IPMs) Show that IPM is convex, i.e., for any probability measures \(μ_1\), \(μ_2\), and \(ν\) and any constant \(λ \in [0,1]\), we have \[ d_{\F}\bigl( λ μ_1 + (1-λ)μ_2, ν \bigr) \le λ d_{\F}(μ_1, ν) + (1-λ) d_{\F}(μ_2, ν). \]

Hint: Use triangle inequality.

Notes

IPMs were first defined by Zolotarev (1984), where they were called probability metrics with \(ζ\)-structure. They are studied in detail in Rachev (1991). The term IPM was coined by Müller (1997). The discussion here is mostly borrowed from Müller (1997).