38 Integral Probablity Metrics
IPM
[TODO: Add introduction]
38.1 Definition of Integral Probability Metrics
Notation
Let
be a complete separable metric space, i.e., a Polish space.Let
be a measurable function and define the weighted norm of any function asLet
denote the set of all measurable real-valued functions on and let denote the subset with bounded -norm, i.e.,Let
denote the Borel -algebra of .Let
denote the set of all probaility measures on and .
Definition 38.1 The integral probability metric
An IPM is a pseudo-metric, i.e., it satisfies the following properties: for all
- Symmetry.
- Triangle inequality.
.
IPM is a metric when
Examples
The Kolmogorov-Smirnov metric. The Kolmogorov-Smirnov metric on
is defined as where and are CDFs corresponding to and , respectively. It has the following properties: iff iff the two distributions have disjoint support, i.e., there exists a such that and is a metric
The Kolmogorov-Smirnov metric is an IPM with
.Total variation metric Let
be a probability space. The total variation metric between two probability measures and on is defined as It has the following properties: iff iff there exists an event such that and . is a metric
When
the measures and have densities and with respect to a measure , then . . , where .
One may view the total variation metric as the total variation norm of the signed measure
. An implication of the first property is that total variation can be expressed as an IPM with the generatorSee Adell and Jodrá (2006) for exact TV distance between some common distributions.
Kantorovich-Rubinstein metric or Wasserstein distance. Let
be a probability space and be a metric on . Let . Then the Wasserstein -distance between two measures and on is defined as where are all couplings (i.e., joint measures on with marginals and . An compact way to write this isIn this section, we focus on
distance (i.e., ), and will drop the subscript for convenience and writeTheorem 38.1 (Kantorovich Rubinstein duality) The Wasserstein distance is equivalent to the following:
ProofThe orignal definition is an (infinite dimensional) linear program:
is a linear function of and the of all couplings is a convex polytope. Taking its dual, we getSimilar in spirit to Legendre transform define the
-dual of the function as Observe that for any Therefore, Thus, . Interchanging the role of and , we get Thus, .Now consider any
and that satisfies the constraint for all . For any , let be the value of that achieves the minimum in the definition of , i.e., Moreover, Thus, for any and that satisfy the constraint, we have where the last inequality follows because we have shown that .Now observe that for any
such that , we have . Hence, for any coupling , we have Thus, for any function such that , we have Consequently,The result then follows from
and .Remark about absolute valueNote that in the proof, we have shown that
but in Theorem 38.1, we use absolute value in the right hand side. This is because (using a slightly terse notation) and both and have the same Lipschitz constant. SoTheorem 38.1 implies that Wasserstein distance can be expressed as an IPM with the generator
Maximum-mean discrepency.
TBW
Weighted total variation.
TBW
38.2 Maximal generators
So far, we haven’t imposed any restriction on the generator
Definition 38.2 For a function class
Maximal generators have the following properties.
Proposition 38.1 Let
contains the convex hull of .- If
, then for all and . - If the sequence
converges uniformly to , then .
Maximal generators also allow us to compare IPMs with different generators.
A set
Proposition 38.2 Let
. .- If
then and are identical. - If
and is convex and balanced, contains the constant functions, and is closed with respect to pointwise convergence, then .
Examples
Kolmogorov-Smirnov metric. The maximal generator of Kolmogorov-Smirnov metric is
where denotes the variation of function , which is defined as where the suprimum is over all partitions of .Total-variation metric. The maximal generator of total-variation metric is
Kantorovich-Rubinstein metric or Wasserstein distance.
The generator of Wasserstein distance given above is actually a maximal generator. Thus,
Maximum-mean discrepency.
Weighted total variation.
38.3 Minkowski functional or gauge of an IPM
We will assume that the generator
For any function
In all the above examples, the Minkowski functional is a semi-norm, which is a general property.
Proposition 38.3 The Minkowski functional is a semi-norm, i.e., it satisfies the following properties:
- Non-negativity.
. - Absolute homogeneity
for all scalars . - Subadditivity.
.
Non-negativity holds by definition. Absolute homogeneity holds because
To see why subadditivity holds, let
Proposition 38.4 For any
By the assumptions imposed on
Proposition 38.2 implies that the maximal generator of
For example, we know that
Now consider the maximal generator for total-variation distance is
Examples
Kolmogorov-Smirnov metric. The Minkowski functional of a function
for Kolmogorov-Smirnov metric is its total variation . Thus, for any function with bounded total variation:Total variation metric. The Minkowski functional of a function
for total variation metric is . Thus, for any function with bounded span:Kantorovich-Rubinstein metric or Wasserstein distance. The Minkowski functional of a function
for Wasserstein distance is . Thus, for any function with bounded Lipschitz constant, we haveMaximum-mean discrepency.
Weighted total variation.
38.4 Properties of IPMs
Definition 38.3 Let
- Property (R) if
. - Property (M) if
. - Property (C) if
for all probability measures .
The next result is from Müller (1997), Theorem~4.7.
Suppose
- Property (R) holds if and only if
- Property (C) holds if any only if
is invariant under translations.
Exercises
Exercise 38.1 Show that for any function
Hint: Suppose
Exercise 38.2 (Convexity of IPMs) Show that IPM is convex, i.e., for any probability measures
Hint: Use triangle inequality.
Notes
IPMs were first defined by Zolotarev (1984), where they were called probability metrics with