38  Integral Probablity Metrics

Updated

November 22, 2024

Keywords

IPM

[TODO: Add introduction]

38.1 Definition of Integral Probability Metrics

Notation

  1. Let (X,d) be a complete separable metric space, i.e., a Polish space.

  2. Let w:X[1,) be a measurable function and define the weighted norm of any function f:XR as (fw=supxXf(x)w(x).

  3. Let M(X) denote the set of all measurable real-valued functions on X and let Mw(X) denote the subset with bounded w-norm, i.e., Mw(X)={fM(X):(fw<}

  4. Let X denote the Borel σ-algebra of X.

  5. Let P(X) denote the set of all probaility measures on (X,X) and Pw(X)={μP(X):wdμ<}.

Integral probability metrics (IPMs) are pseudo-metrics on Pw(X) defined as follows.

Definition 38.1 The integral probability metric dF is a pseudo-metric on Pw(X) generated by the function class FMw(X) is defined as dF(μ1,μ2)=supfF|fdμ1fdμ2|,μ1,μ2Pw(X). The function f is called the witness function and the function class F is called the generator.

An IPM is a pseudo-metric, i.e., it satisfies the following properties: for all μ1,μ2,μ3Pw(X)

  1. dF(μ1,μ1)=0
  2. Symmetry. dF(μ1,μ2)=dF(μ2,μ1)
  3. Triangle inequality. dF(μ1,μ3)dF(μ1,μ2)+dF(μ2,μ3).

IPM is a metric when d(μ1,μ2)=0μ1=μ2.

Examples

  1. The Kolmogorov-Smirnov metric. The Kolmogorov-Smirnov metric on R is defined as KS(μ,ν)=suptR|(M(t)N(t)| where M and N are CDFs corresponding to μ and ν, respectively. It has the following properties:

    • KS(μ,ν)[0,1]
    • KS(μ,ν)=0 iff μ=ν
    • KS(μ,ν)=1 iff the two distributions have disjoint support, i.e., there exists a sR such that μ((,s])=1 and ν((,s])=0
    • KS is a metric

    The Kolmogorov-Smirnov metric is an IPM with F={1[t,):tR}.

  2. Total variation metric Let (S,S) be a probability space. The total variation metric between two probability measures μ and ν on (S,S) is defined as TV(μ,ν)=supBS|(μ(B)ν(B)|. It has the following properties:

    • TV(μ,ν)[0,1]
    • TV(μ,ν)=0 iff μ=ν
    • TV(μ,ν)=1 iff there exists an event BS such that μ(B)=1 and μ(B)=0.
    • TV is a metric

    When S=Rk the measures μ and ν have densities f and g with respect to a measure λ, then

    • TV(μ,ν)=12|(f(x)g(x)|λ(dx).
    • TV(μ,ν)=1min{f(x),g(x)}λ(dx).
    • TV(μ,ν)=μ(B)ν(B), where B={x:f(x)g(x)}.

    One may view the total variation metric as the total variation norm of the signed measure μν. An implication of the first property is that total variation can be expressed as an IPM with the generator F={f:(f1}.

    See Adell and Jodrá () for exact TV distance between some common distributions.

  3. Kantorovich-Rubinstein metric or Wasserstein distance. Let (S,S) be a probability space and dS be a metric on S. Let p[1,+]. Then the Wasserstein p-distance between two measures μ and ν on (S,S) is defined as Wasp(μ,ν)=(infγΓ(μ,ν)S×SdS(x,y)dγ(x,y))1/p where Γ(μ,ν) are all couplings (i.e., joint measures on (S×S,SS) with marginals μ and ν. An compact way to write this is Wasp(μ,ν)=(infγΓ(μ,ν)E(X,Y)γ[dS(X,Y)p])1/p

    In this section, we focus on Was1 distance (i.e., p=1), and will drop the subscript 1 for convenience and write Was(μ,ν)=infγΓ(μ,ν)E(X,Y)γ[dS(X,Y)].

    Theorem 38.1 (Kantorovich Rubinstein duality) The Wasserstein distance is equivalent to the following: Was(μ,ν)=supf:fL1|EXμ[f(X)]EYν[f(Y)]|

    The orignal definition is an (infinite dimensional) linear program: E(X,Y)γ[dS(X,Y)] is a linear function of γ and the of all couplings Γ(μ,ν) is a convex polytope. Taking its dual, we get Was(μ,ν)=supf,g:SR[EXμ[f(X)]+EYν[g(Y)]]such thatf(x)+g(y)d(x,y).

    Similar in spirit to Legendre transform define the dS-dual of the function g as g(x)=infuS[d(x,u)g(u)] Observe that for any u g(x)d(x,u)g(u)d(x,y)+d(y,u)g(u). Therefore, g(x)d(x,y)+minuS[d(y,u)g(u)]=d(x,y)+g(y). Thus, g(x)g(y)dS(x,y). Interchanging the role of x and y, we get |g(x)g(y)|dS(x,y). Thus, gL1.

    Now consider any f and g that satisfies the constraint f(x)+g(y)d(x,y) for all x,y. For any x, let u be the value of u that achieves the minimum in the definition of g(x), i.e., g(x)=d(x,u)g(u)f(x). Moreover, g(x)d(x,x)g(x)=g(x). Thus, for any f and g that satisfy the constraint, we have Was(μ,ν)=supf,g:f(x)+g(y)dS(x,y)[EXμ[f(X)]+EYν[g(Y)]]supg[EXμ[g(X)]EYν[g(Y)]](1)supf:fL1[EXμ[f(X)]EYν[f(Y)]] where the last inequality follows because we have shown that gL1.

    Now observe that for any f such that fL1, we have f(x)f(y)d(x,y). Hence, for any coupling γΓ(μ,ν), we have E(X,Y)γ[d(X,Y)]E(X,Y)γ[f(X)f(Y)]=EXμ[f(X)]EYν[f(Y)]. Thus, for any function f such that fL1, we have EXμ[f(X)]EYν[f(Y)]infγΓ(μ,ν)E(X,Y)γ[d(X,Y)]=Was(μ,ν). Consequently, (2)supf:fL1[EXμ[f(X)]EYν[f(Y)]]Was(μ,ν).

    The result then follows from (1) and (2).

    Remark about absolute value

    Note that in the proof, we have shown that Was(μ,ν)=supf:fL1EXμ[f(X)]EYν[f(Y)] but in , we use absolute value in the right hand side. This is because (using a slightly terse notation) |fdμfdν| is eitherfdμfdν or (f)dμ(f)dν and both f and f have the same Lipschitz constant. So supf:fL1[fdμfdν]=supf:fL1|fdμfdν|

    implies that Wasserstein distance can be expressed as an IPM with the generator F={f:fL1}.

  4. Maximum-mean discrepency.

    TBW

  5. Weighted total variation.

    TBW

38.2 Maximal generators

So far, we haven’t imposed any restriction on the generator F. We now present a canonical representation of a generator.

Definition 38.2 For a function class FMw(X), define the maximal generator GF as GF={fMw(X):|fdμ1fdμ2|dF(μ1,μ2), for all μ1,μ2Pw(X)}.

Maximal generators have the following properties.

Proposition 38.1 Let FMw(X) be an arbitary generator. Then:

  1. GF contains the convex hull of F.
  2. If fGF, then αf+βGF for all |α|1 and βR.
  3. If the sequence {fn}n1GF converges uniformly to f, then fGF.

Maximal generators also allow us to compare IPMs with different generators.

Balanced set

A set S is said to be balanced if for all scalars a satisfying |a|1, we have aSS.

Proposition 38.2 Let FDMw(X) and μ1,μ2Pw(X). Then

  1. dF(μ1,μ2)dD(μ1,μ2).
  2. GFGD.
  3. If DGF then dF and dD are identical.
  4. If FDGF and D is convex and balanced, contains the constant functions, and is closed with respect to pointwise convergence, then D=GF.

Examples

  1. Kolmogorov-Smirnov metric. The maximal generator of Kolmogorov-Smirnov metric is GF={f:V(f)1} where V(f) denotes the variation of function f, which is defined as V(f)=supPk=1n|(f(xk)f(xk1)|. where the suprimum is over all partitions P={x0,,xn} of R.

  2. Total-variation metric. The maximal generator of total-variation metric is GF={f:12sp(f)1}

  3. Kantorovich-Rubinstein metric or Wasserstein distance.

    The generator of Wasserstein distance given above is actually a maximal generator. Thus, GF={f:fL1}

  4. Maximum-mean discrepency.

  5. Weighted total variation.

38.3 Minkowski functional or gauge of an IPM

We will assume that the generator F is the maximal generator. In particular, this means that 0F is an interior point. Define (0,)F:=0<r<rF, which is a subset of Mw(X).

For any function fMw(X) define the Minkowski functional (or gauge) as follows: ρF(f)=inf{ρ>0:f/ρF}. where the infimum of an empty set is defined as infinity. Therefore, ρF(f)< if f(0,)F; otherwise ρF(f)=.

In all the above examples, the Minkowski functional is a semi-norm, which is a general property.

Proposition 38.3 The Minkowski functional is a semi-norm, i.e., it satisfies the following properties:

  1. Non-negativity. ρ(f)0.
  2. Absolute homogeneity ρF(af)=|a|ρF(f) for all scalars a.
  3. Subadditivity. ρF(f1+f2)ρF(f1)+ρF(f2).

Non-negativity holds by definition. Absolute homogeneity holds because F is balanced, in particular F=F. Note that absolute homogeneity implies ρF(0)=0.

To see why subadditivity holds, let ρ1=ρF(f1) and ρ2=ρF(f2). Then, since F is convex, we have for any λ[0,1], λf1ρ1+(1λ)f2ρ2F. Take λ=ρ1/(ρ1+ρ2), which implies (f1+f2)/(ρ1+ρ2)F. Thus, ρF(f1+f2)ρ1+ρ2.

Proposition 38.4 For any f(0,)F such that ρF(f)0 and any μ1,μ2Pw(X), we have |fdμ1fdμ2|ρF(f)dF(μ1,μ2).

By the assumptions imposed on f, we know that ρF(ρ)(0,). Define f=f/ρF(f)F. Then, |fdμ1fdμ2|=ρF(f)|fdμ1fdμ2|ρF(f)dF(μ1,μ2) where the last inequality uses the fact that fF.

Why care about maximal generators

implies that the maximal generator of F is the smallest superset of F that is convex and balanced, contains the constant functions, and is closed with respect to pointwise convergence. The IPM distance wrt a generator and its maximal generator are the same but the Minkowski functionals are not! In particular if FD, then ρF(f)ρD(f). So, working with the maximal generators gives us the tightest version of the bound in .

For example, we know that {f:(f1} is a generator for total-variation distance. Using this, the result of implies that (3)|fdμ1fdμ2|(fTV(μ1,μ2). Note that the left hand side is invariant to addition by a constant, but the right hand side is not. Therefore, this bound cannot be tight.

Now consider the maximal generator for total-variation distance is {f:12spf1}. Using this, the result of implies that (4)|fdμ1fdμ2|12sp(f)TV(μ1,μ2). In this case, the right hand side is invariant to additional by a constant. Moreover, (f12sp(f) (see ). Therefore, the bound of (4) is tighter than that of (3).

Examples

  1. Kolmogorov-Smirnov metric. The Minkowski functional of a function f for Kolmogorov-Smirnov metric is its total variation V(f). Thus, for any function with bounded total variation: |fdμfdν|KS(μ,v)V(f).

  2. Total variation metric. The Minkowski functional of a function f for total variation metric is 12sp(f). Thus, for any function with bounded span: |fdμfdν|TV(μ,v)12sp(f).

  3. Kantorovich-Rubinstein metric or Wasserstein distance. The Minkowski functional of a function f for Wasserstein distance is fL. Thus, for any function with bounded Lipschitz constant, we have |fdμfdν|Was(μ,v)fL.

  4. Maximum-mean discrepency.

  5. Weighted total variation.

38.4 Properties of IPMs

Definition 38.3 Let (S,d) be a metric vector space, w:S[1,) a weight function, and dF any semi-metric on Pw. Then, dF is said to have

  • Property (R) if dF(δa,δb)=d(a,b).
  • Property (M) if dF(aX,aY)=adF(X,Y).
  • Property (C) if dF(P1Q,P2Q)dF(P1,P2) for all probability measures Q.

The next result is from Müller (), Theorem~4.7.

Suppose dF is an IPM with a maximal generator GF. Then

  1. Property (R) holds if and only if supfF|(f(x)f(y)|d(x,y)=1,for all x,yS,xy.
  2. Property (C) holds if any only if GF is invariant under translations.

Exercises

Exercise 38.1 Show that for any function f: 12sp(f)(f.

Hint: Suppose f is such that max(f)=min(f), i.e., the function is centered at 0. Then, the inequality holds with equality. What can you say when the function is not centered at 0?

Exercise 38.2 (Convexity of IPMs) Show that IPM is convex, i.e., for any probability measures μ1, μ2, and ν and any constant λ[0,1], we have dF(λμ1+(1λ)μ2,ν)λdF(μ1,ν)+(1λ)dF(μ2,ν).

Hint: Use triangle inequality.

Notes

IPMs were first defined by Zolotarev (), where they were called probability metrics with ζ-structure. They are studied in detail in Rachev (). The term IPM was coined by Müller (). The discussion here is mostly borrowed from Müller ().

close all nutshells