53 Convex Duality

Author

Affiliation

Aditya Mahajan

McGill University

Updated

July 19, 2024

53.1 Basic Intuition

We start by stating a basic fact

Characterization of convex functions

Any convex function can be represented as a pointwise supremum of convex functions. More formally, for any convex \(f \colon \reals^n \to \reals\), there exists a countable index set \(I\) and a family of \(\{a_i \in \reals^n\}_{i \in I}\) and \(\{b_i \in \reals\}_{i \in I}\) such that \[ f(x) = \sup_{i \in I} \{ a_i^\TRANS x + b_i \} \]

Smax = 2
n = 100
np = 3

f = function(s) { return s*s + 1 }
// Legendre transform of f
g = function(s) { return s*s/4 - 1 }

points = {

  var points = new Array()
  var idx = 0
  for( var i = -n ; i <= n; i++) {
    var s = Smax*i/n
    points[idx++] = { point: s, value: f(s) }
  }
  return points
  }

legendre = {
  var points = new Array()
  var idx = 0

  for( var i = -np; i <= np; i++) {
    var p = Smax*i/np
    var gp = g(p)
    points[idx++] = { x: -Smax, y: -Smax*p - gp, index: i }
    points[idx++] = { x:  Smax, y:  Smax*p - gp, index: i }
  }

  return points
}

averagePlot = Plot.plot({
  grid: true,
  y: { domain: [0, 5] },

  marks: [
    // Axes
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    // Data
    Plot.line(legendre, {x:"x", y:"y", z:"index", stroke:"gray"} ),

    Plot.line(points, {x:"point", y:"value",
                      stroke:"red", strokeWidth: 4}),
  ]
})

Figure 53.1: Convex function as a supremum of affine functions

An illustration of this fact is shown in Figure 53.1.

Why countable index set?

From the figure you can convince yourself of the result for an uncountable index set \(\bar I\). Now consider the set \(\bar S = \{a_i, b_i\}_{i \in \bar I} \subset \reals^{n+1}\) (for uncountable \(\bar I\)). Since \(\reals^{n+1}\) is separable, let \(S = \{a_i, b_i\}_{i \in I}\) be a countable dense subset of \(\bar S\) (where \(I\) is countable). Since the supremum over any set is the same as the supremum of its dense subset, we can replace \(\bar I\) by \(I\).

The Lengendre-Fenchel transform is a compact way of representing this basic fact. To fix ideas, consider a twice differentiable and strictly convex function \(f \colon \reals \to \reals\). Arbitrarily fix a point \(x^∘\) and consider the tangent \(τ(x)\) to the plot of \(f(⋅)\) at \(x^∘\). This tangent has a slope \(p = f'(x^∘)\). Let \(g\) denote the negative \(y\)-intercept of the tangent (using the negative \(y\)-intercept is just a matter of convention). Thus, the equation for the tangent is: \[ τ(x) = p x - g, \quad \forall x \in \reals \] at at the point \(x\), the tangent \(τ\) intercepts the function \(f\); thus, \(τ(x^∘) = f(x^∘)\), or equivalently: \[ f(x^∘) = p x^∘ - g \quad\hbox{or, equivalently}\quad g = p x^∘ - f(x^∘) \]

Note that the \(g\) above depends on \(x^∘\). Since \(f\) is strictly convex, \(f'\) is strictly increasing. Thus, there is a one-to-one relationship between the point \(x^∘\) and its slope \(p = f'(x^∘)\). The Lengendre-Fenchel transform is a method to define the intercept \(g\) as a function of \(p\). In particular, it is defined as \[ g(p) = \sup_{x \in \reals} \{ px - f(x) \}. \]

To understand this definition, suppose \(x^∘\) achieves the supremum in the above equation. Then, (i) \(f(x^∘) = p x^∘ - g(p)\); that is the line \(px - g(p)\) touches the function \(f\) at \(x^∘\); and (ii) at all other points \(x \neq x^∘\), \(g(p) > p x - f(x)\) or \(f(x) > px - g(p)\); that is the line \(px - g(p)\) lies below the function \(f\). Hence, \(px - g(p)\) is a :supporting line of \(f\) (and since \(f\) is differentiable, equal to its tangent).

An implication of this definition is that for differentiable and convex \(f\), the Legendre transform \(g\) of \(f\) solves \[ g(p) = p x^∘ - f(x^∘) \] where \(x^∘\) solves \(p = f'(x^∘)\), which is called the duality condition.

Example 53.1 Let \(f(x) = x^2\). For a fixed \(p\), the duality condition is \(p = 2x^∘\) or \(x^∘ = p/2\). Therefore, \[ g(p) = p x^∘ - f(x^∘) = \frac{p^2}{2} - \frac{p^2}{4} = \frac{p^2}{4}. \]

Example 53.2 Let \(f(x) = (x^α)/α\), where \(α > 1\). For a fixed \(p\), the duality condition is \(p = (x^∘)^{α-1}\) or \(x^∘ = p^{1/(p-1)}\). Therefore, \[ g(p) = p x^∘ - f(x^∘) = p^{α/(α-1)} - \frac{1}{α} p^{α/(α-1)} = \frac{α-1}{α} p^{α/(α-1)}. \] We can compactly write the above expression is \(g(p) = p^β/β\), where \(1/β = 1 - 1/α\).

Example 53.3 Let \(f(x) = x \log x + (1-x) \log (1-x)\), where the domain is \([0,1]\). This is the negative binary entropy. For a fixed \(p\), the duality condition is \[ p = f'(x^∘) = \log(x^∘) + 1 - \log(1-x^∘) - 1 % = \log(x^∘) - \log(1-x^∘) = \log \frac{x^∘}{1-x^∘} \] Thus, \(x^∘ = e^p/(1 + e^p)\). Therefore, \[ g(p) = p x^∘ - f(x^∘) = \log(1 + e^p), \] where the last step follows after some algebra.

Fenchel-Young inequality

Let \(g\) be the Legendre-Fenchel transform of \(f\). Then, by definition (changing the variable names for convenience), we have \[\begin{equation}\tag{Fenchel's inequality} xy \le f(x) + g(y). \end{equation}\] For the special case of Example 53.2, we have \[\begin{equation} \tag{Young's inequality} xy \le \frac{x^p}{p} + \frac{y^q}{q} \end{equation}\] where \(p, q > 1\) are such that \(\frac 1p + \frac 1q = 1\).

An interesting example of Fenchel-Young inequality is the following. Consider a real-valued, strictly increasing, continuous function \(h\) on \(\reals\) which satisfies \(h(0) = 0\), \(h(x) \to -∞\) as \(x \to -∞\), and \(h(x) \to ∞\) as \(x \to ∞\). Since \(h\) is continuous and strictly increasing, it has an inverse. Define \[\begin{equation}\label{eq:legendre-example} f(x) = \int_{0}^x h(s) ds \quad\hbox{and}\quad g(y) = \int_{0}^y h^{-1}(t) dt. \end{equation}\]

exampleFunction = function(s) {
    if (s <= 1) {
        return (s - s**2/2)
    } else
    {
        return 1/2 + (s - 1)**2/2
    }
  }

examplePoints = {
  var points = new Array()
  var idx = 0

  const n = 500;
  const Smax = 2; 

  for(var i = 0; i <= n; i++) {
      var s = Smax*i/n
      points[idx++] = { x: s, y: exampleFunction(s) } 
  }
  return points
}

Plot.plot({
  grid: true,
  x : { label: "s"}, 
  y : { label: "t"}, 

  marks: [
    // Fill
    Plot.areaY(examplePoints.filter(pt => pt.x <= 1), {x:"x", y:"y",
               fill:"lightblue", fillOpacity: 0.5}),

    Plot.areaX(examplePoints.filter(pt => pt.y <= 0.7), {x:"x", y:"y",
               fill:"pink", fillOpacity: 0.5}),

    // Axes
    Plot.ruleX([0]),
    Plot.ruleY([0]),

    Plot.line([ {x:1, y:0}, {x:1, y: exampleFunction(1)} ], {x:"x", y:"y", stroke: "gray"} ),

    Plot.line([ {x:1, y: exampleFunction(1)}, {x:1, y: 0.7} ], {x:"x", y:"y", strokeDasharray: "4", stroke: "gray"} ),

    Plot.line([ {x:0, y:0.7}, {x:1.62, y:0.7} ], {x:"x", y:"y", stroke: "gray"} ),

    // Line
    Plot.line(examplePoints, {x:"x", y:"y",
              stroke:"red", strokeWidth: 4}),
    
  ]
})
Plot.plot({
  grid: true,
  x : { label: "s"}, 
  y : { label: "t"}, 

  marks: [
    // Fill
    Plot.areaY(examplePoints.filter(pt => pt.x <= 1.4), {x:"x", y:"y",
               fill:"lightblue", fillOpacity: 0.5}),

    Plot.areaX(examplePoints.filter(pt => pt.y <= 0.4), {x:"x", y:"y",
               fill:"pink", fillOpacity: 0.5}),

    // Axes
    Plot.ruleX([0]),
    Plot.ruleY([0]),

    Plot.line([ {x:1.4, y:0}, {x:1.4, y: exampleFunction(1.4)} ], {x:"x", y:"y", stroke: "gray"} ),

    Plot.line([ {x:0.55, y: 0.4}, {x:1.4, y: 0.4} ], {x:"x", y:"y", strokeDasharray: "4", stroke: "gray"} ),

    Plot.line([ {x:0, y:0.4}, {x:0.55, y:0.4} ], {x:"x", y:"y", stroke: "gray"} ),

    // Line
    Plot.line(examplePoints, {x:"x", y:"y",
              stroke:"red", strokeWidth: 4}),
    
  ]
})

(a) \(x < y\), with \(x = 1\) and \(y = 0.7\)

(b) \(x > y\), with \(x = 1.4\) and \(y = 0.4\)

Figure 53.2: The functions \(f(x)\) and \(g(y)\) defined in \eqref{eq:legendre-example}. The red curve shows the graph of \(t = h(s)\). The shaded blue region shows the value of \(f(x)\) for a particular value of \(x\). The red curve shown the value of \(g(y)\) for a particular value of \(y\).

The graph of \(t = h(s)\) is shown in Figure 53.2, where the shaded portions represent \(f(x)\) and \(g(y)\). From the plots, we can infer Fenchel-Young’s inequality: \[ xy \le f(x) + g(y) \] with equality if and only if \(y = h(x) = \dot f(x)\). This immediately implies that \[ g(y) = \sup_{x \in \reals} \bigl\{ xy - f(x) \bigr\} \] and \[ f(x) = \sup_{y \in \reals} \bigl\{ xy - g(y) \bigr\}. \]

An alternative view

Suppose \(f\) is a convex function. Then it can be shown that its Legendre transform \(g\) is the unique function such that:

\(f'\) and \(g'\) are inverses of each other, i.e., \(g'(f'(x)) = x\) and \(f'(g'(p)) = p\).
It satisfies the Fenchel inequality with equality, i.e., \(f(x) + g(p(x)) = x p(x)\), for all \(x\)

The first property shows that \(g\) can be computed up to additive constants by solving: \[ g(p) = \int_0^p (f')^{-1}(q) dq + \text{constant} \] where we can use the second property to find the constant.

We illustrate this via the above examples.

In Example 53.1, \(f'(x) = 2x\). Therefore, \((f')^{-1}(x) = x/2\). Hence,

\[ g(p) = \int_{0}^p \frac{q}{2} dq + c = \frac{p^2}{4} + c. \]
In Example 53.2, \(f'(x) = x^{α-1}\). Therefore, \((f')^{-1} = x^{1/(α-1)}\). Hence \[ g(p) = \int_{0}^p q^{1/(α-1)} dq + c = \frac{α-1}{α} p^{α/(α-1)} + c \]
In Example 53.3, \(f'(x) = \log(x) + \log(1-x)\). Therefore, \((f')^{-1}(x) = e^x/(1 + e^x)\). Hence \[ g(p) = \int_{0}^p \frac{e^q}{1 + e^q} dq + c = \log(1 + e^p) + c'. \]

53.2 General definition

Definition 53.1 For any function \(f \colon \reals^n \to \bar {\reals}\), the function \(f^* \colon \reals^n \to \bar \reals\) defined by \[\begin{equation}\label{eq:conjugate} f^*(p) \coloneqq \sup_{x \in \reals^n} \bigl\{ \langle p, x \rangle - f(x) \bigr\} \end{equation}\] is conjugate to \(f\), while the function \(f^{**} = (f^*)^*\) defined by \[\begin{equation}\label{eq:biconjugate} f^{**}(x) \coloneqq \sup_{p \in \reals^n} \bigl\{ \langle p, x \rangle - f^*(p) \bigr\} \end{equation}\] is biconjugate to \(f\). The mapping \(f \mapsto f^*\) is called Legendre-Fenchel transform.

The significance of the conjugate can be understood in terms of epigraph relationships. Note that \eqref{eq:conjugate} implies that \[ (p,β) \in \text{epi } f^* \iff β \ge \langle p, x \rangle - α \text{ for all } (x,α) \in \text{epi } f. \] Let \(\ell_{p,β}(x) \coloneqq \langle p, x \rangle - β\), then we can write the above relationship as \[ (p,β) \in \text{epi } f^* \iff \ell_{p,β} \le f, \] that is, \(f^*\) describes a family of affine functions majorized by \(f\). Similarly, \[ β \ge f^*(p) \iff β \ge \ell_{x,α}(p) \text{ for all } (x,α) \in \text{epi } f. \]

Theorem 53.1 For any function \(f \colon \reals^n \to \bar \reals\) with \(\text{con } f\) proper, both \(f^*\) and \(f^{**}\) are proper, lsc, convex and \[ f^{**} = \text{cl con } f. \] Thus, \(f^{**} \le f\) and when \(f\) is itself proper, lsc, and convex, one has \(f^{**} = f\).

53.3 Properties of Legendre-Fenchel transform

Thus, Legendre-Fenchel transform sets up a one-to-one correspondence in the class of proper, lsc, and convex functions: if \(f\) is conjugate to \(g\), then \(g\) is a conjugate to \(f\): \[ f \xleftrightarrow{\hskip0.5em*\hskip0.5em} g \text { when } \begin{cases} g(p) = \sup_{x \in \reals^n} \bigl\{ \langle p, x \rangle - f(x) \bigr\} \\ f(x) = \sup_{x \in \reals^n} \bigl\{ \langle p, x \rangle - g(p) \bigr\} \end{cases} \]

Given \(f \xleftrightarrow{\hskip0.5em*\hskip0.5em} g\), we have the following properties:

Scaling properties. For any \(λ > 0\), \[\begin{align*} λ f(x) &\xleftrightarrow{\hskip0.5em*\hskip0.5em} λg(λ^{-1} p), \\ f(λx) &\xleftrightarrow{\hskip0.5em*\hskip0.5em} g(λ^{-1} p), \end{align*}\]
Translation properties. \[\begin{align*} f(x) - \langle a, x \rangle &\xleftrightarrow{\hskip0.5em*\hskip0.5em} g(p + a) , \\ f(x + b) &\xleftrightarrow{\hskip0.5em*\hskip0.5em} g(p) - \langle p, b \rangle, \\ f(x) + c &\xleftrightarrow{\hskip0.5em*\hskip0.5em} g(p) - c, \\ \end{align*}\]

Proposition 53.1 Let \(f\) be a finite, coercive, twice differentiable, and strongly convex function, then the conjugate \(g = f^*\) is also finite, coercive, twice differentiable, and strongly convex. Moreover,

the gradient mapping \(\GRAD f\) is one-to-one from \(\reals^n\) to \(\reals^n\), and its inverse is \(\GRAD g\); one has \[\begin{align*} g(p) &= \bigl\langle (\GRAD f)^{-1}(p), p \bigr\rangle - f\bigl((\GRAD f)^{-1}(p)\bigr), \\ f(x) &= \bigl\langle (\GRAD g)^{-1}(x), x \bigr\rangle - g\bigl((\GRAD g)^{-1}(x)\bigr), \end{align*}\]
The matrices \(\GRAD^2 f(x)\) and \(\GRAD^2 g(p)\) are inverse to one another when \(p = \GRAD f(x)\) or, equivalently, \(x = \GRAD g(p)\).

Strongly convex functions on a simplex have the following properties.

Proposition 53.2 Let \(Δ\) be the simplex in \(\reals^n\) and \(f \colon Δ \to \reals\) be twice differentiable and strongly convex. Let \(g \colon \reals^n \to \reals\) be the Legendre-Fenchel conjugate of \(f\). Then,

Unique maximizing argument: \(\GRAD g\) is Lipschitz and satisfies \[\GRAD g(p) = \arg\max_{x \in Δ}\bigl\{ \langle x, p \rangle - f(x) \bigr\}.\]
Boundedness: If there are constants \(L\) and \(U\) such that for all \(x \in Δ\), we have \(L \le f(x) \le U\), then \[ \| p \|_{∞} - U \le g(p) \le \| p \|_{∞} - L. \]
Distributivity: For any \(c \in \reals\), \[ g(p + c \ONES) = g(p) + c. \]
Monotonicity: If \(p_1 \le p_2\), then \(g(p_1) \le g(p_2)\).

Notes

The material on the intuition behind Legendre-Fenchel transform is adapted from Kennerly (2011). The example for Young-Fenchel inequality and Figure 53.2 is taken from Ellis (1985). The material on general definition and properties is adapted from Rockafellar and Wets (2009). Proposition 53.1 is stated as Example 11.9 in Rockafellar and Wets (2009). Proposition 53.2 is from Geist et al. (2019).