Moment Generating Functions

Updated

November 11, 2024

1 Moment Generating Functions

The moment generating function (MGF) of a random variable \(X\) is defined as \[ M_X(s) = \EXP[e^{sX}] \] provided the expectation exists.

  • When \(X\) is discrete, we have \[ M_X(s) = \sum_{x \in \text{range}(X)} e^{sx} p_X(x). \]

  • When \(X\) is continuous, we have \[ M_X(s) = \int_{-∞}^∞ e^{sx} f_X(x) \, dx. \]

Relationship to Laplace transforms

Although most texts (including the textbook) restrict \(s\) to be real, my personal view is that one should really interpret \(s\) as a complex number. If we do so, then we have the following:

  • \(M_X(-s)\) is the Laplace transform of the PDF.
  • \(M_X(-j ω)\) is the Fourier transform of the PDF, which is called the characteristic function of \(X\).

Therefore, we can recover the PDF by taking the inverse Laplace transform of MGF. Thus, specifying the MGF of a random variable is equivalent to specifying the PDF.

Historically, MGF is defined for \(s \in \reals\) and there are distributions (e.g., Cauchy) for which MGF does not exist for any \(s \neq 0\). In avoid such situations, one uses the characteristic function because the characteristic function always exists. However, if we view the domain of MGF to be \(\mathbb{C}\), then there is no need for a distinction between MGF and characteristic function.

Example 1 Suppose \(X\) is a random variable which takes values \(\{0, 1, 2\}\) with probabilities \(\{\frac 12, \frac 13, \frac 16\}\). Then, \[\begin{align*} M_X(s) &= \EXP[e^{sX}] \\ &= \frac 12 e^{s 0} + \frac 13 e^{s 1} + \frac 16 e^{s 2} \\ &= \frac 12 + \frac 13 e^{s} + \frac 16 e^{2s}. \end{align*}\]

Example 2 Find the MGF of a Poisson random variable with parameter \(λ\).

\[\begin{align*} M_X(s) &= \EXP[e^{sX}] = \sum_{k=0}^{∞}e^{ks} \frac{λ^k e^{-λ}}{k!} \\ &= e^{-λ} \sum_{k=0}^{∞} \frac{(λe^s)^k}{k!} \\ &= e^{-λ} e^{λ e^{s}}. \end{align*}\]

Example 3 Find the MGF of an exponential random variable with parameter \(λ\).

\[\begin{align*} M_X(s) &= \EXP[e^{sX}] \\ &= \int_{0}^∞ e^{sx} λ e^{-λx} \, dx \\ &= λ \int_{0}^∞ e^{(s-λ)x} \, dx \\ &= \frac{λ}{λ-s}. \end{align*}\]

Note that we could have looked up this result from the Laplace transform tables which show that \[ e^{at} \xleftrightarrow{\hskip 0.5em \mathcal{L}\hskip 0.5em } \frac{1}{s-a} \]

The MGF of common random variables is shown in Table 1.

Table 1: MGF of common random vables
Random variable Parameter(s) MGF
Bernoulli \(p\) \(1 - p + p e^s\)
Binomial \((n,p)\) \((1-p + p e^s)^n\)
Geometric \(p\) \(\dfrac{p e^s}{1 - (1-p)e^s}\)
Poisson \(λ\) \(\exp(λ e^s - 1)\)
Uniform \((a,b)\) \(\dfrac{e^{sb} - e^{sa}}{s(b-a)}\)
Exponential \(λ\) \(\dfrac{λ}{λ-s}\)
Gaussian \((μ,σ)\) \(\exp\bigl(μ s + \frac 12 σ^2 s^2 \bigr)\)

If there exist a neighborhood around origin where \(M_X(s)\) is well-defined. Then, we can use the MGF to “generate the moments” of \(X\) as follows:

  1. \(M_X(0) = 1\)

  2. \(\dfrac{d}{ds} M_X(s) \biggr|_{s=0} = \EXP[X]\).

  3. \(\dfrac{d^2}{ds^2} M_X(s) \biggr|_{s=0} = \EXP[X^2]\).

  4. and in general \(\dfrac{d^k}{ds^k} M_X(s) \biggr|_{s=0} = \EXP[X^k]\).

  5. Hence, by Taylor series expansion of \(M_X(s)\) within the radius of convergence, we get \[ M_X(s) = \sum_{k=0}^∞ \frac{\EXP[X^k]}{k!} s^k. \] Thus, when if the MGF is well defined in a neighborhood around origin, knowing all the moments of a distribution is sufficient to construct the MGF. We already saw that the MGF is sufficient to construct the PDF/PMF of the distribution. Thus, the distribution of a random variable is completely characterized by its moments.

The first property follows from definition: \[ M_X(0) = \EXP[e^{0 X}] = \EXP[1] = 1. \]

For the general derivative, we have \[\begin{align*} \frac{d^k}{ds^k} M_X(s) &= \int_{-∞}^∞ \frac{d^k}{ds^k} e^{sx} f_X(x) \, dx \\ &= \int_{-∞}^∞ x^k e^{sx} f_X(x) \, dx. \end{align*}\]

Therefore \[ \frac{d^k}{ds^k} M_X(s) \biggr|_{s=0} = \int_{-∞}^∞ x^k f_X(x) \, dx. \]

Example 4 Use the MGF of Bernoulli to find its first all the moments of \(X\).

From Table 1, we see that \[ M_X(s) = 1 - p + p e^{s}. \] Therefore,

  • \(\dfrac{d}{ds} M_X(s) = p e^s\).

  • \(\dfrac{d^2}{ds^2} M_X(s) = p e^s\).

  • and in general \(\dfrac{d^k}{ds^k} M_X(s) = p e^s\).

Thus,

  • \(\EXP[X] = \dfrac{d}{ds} M_X(s) \biggr|_{s=0} = p\).

  • \(\EXP[X^2] = \dfrac{d^2}{ds^2} M_X(s) \biggr|_{s=0} = p\).

  • and in general \(\EXP[X^k] = \dfrac{d^k}{ds^k} M_X(s) \biggr|_{s=0} = p\).

1.1 Moment generating functions and sums of independent random variables

Theorem 1 Suppose \(X_1, X_2, \dots, X_n\) are independent random variables defined on the same probability space. Let \[ Z = X_1 + X_2 + \dots + X_n \] Then, \[ M_Z(s) = M_{X_1}(s) M_{X_2}(s) \cdots M_{X_n}(s). \]

Furthermore, if the variables are identically distributed, then \[ M_Z(s) = (M_{X_1}(s))^n. \]

The proof follows immediately from properties of independent random variables. We prove the result for \(n=2\). \[ M_Z(s) = \EXP[e^{sX_1} e^{sX_2}] = \EXP[e^{sX_1} e^{sX_2}] = M_{X_1}(s) M_{X_2}(s). \]

Theorem 1 is a very useful result. An immediate implication of the result is that following:

  1. Sum of i.i.d. Bernoulli random variables is a Binomial random variable

    Let \(X_i \sim \text{Ber}(p)\). Then \(M_{X_i}(s) = (1 - p + pe^s)\).

    Let \(Z = \sum_{i=1}^n X_i\). Then \(M_Z(s) = (1 - p + pe^s)^n\).

  2. Sum of independent Binomial random variables with the same \(p\) is a Binomial random variable.

    Let \(X_i \sim \text{Binom}(m_i,p)\). Then \(M_{X_i}(s) = (1 - p + p e^s)^{m_i}\).

    Let \(Z = \sum_{i=1}^n X_i\). Then \(M_Z(s) = (1 - p + p e^s)^M\), where \(M = \sum_{i=1}^n m_i\).

  3. Sum of independent Poisson random variables is Poisson.

    Let \(X_i \sim \text{Pois}(λ_i)\). Then \(M_{X_i}(s) = e^{λ_i(e^s - 1)}\).

    Let \(Z = \sum_{i=1}^n X_i\). Then \(M_Z(s) = e^{λ(e^s - 1)}\), where \(λ = \sum_{i=1}^n λ_i\).

  4. Sum of Gaussian random variables is Gaussian.

    Let \(X_i \sim \mathcal N(μ_i, σ_i)\). Then, \(M_{X_i}(s) = \exp( μ_i s + \frac 12 σ_i^2 s^2)\).

    Let \(Z = \sum_{i=1}^n X_i\). Then \(M_Z(s) = \exp(μ s + \frac 12 σ^2 s^2)\), where \[ μ = \sum_{i=1}^n μ_i \quad\text{and}\quad σ^2 = \sum_{i=1}^n σ_i^2. \]

2 The Central Limit Theorem

One of the ways in which MGFs are useful is that they allow us to understand the limiting behavior of sum of i.i.d. random variables.

Convergence in distribution

A sequence of random variables \(\{X_n\}_{n \ge 1}\) is said to converge in distribution to a random variable \(X\) (denoted by \(X_n \xrightarrow{D} X\)) if \[ \lim_{n \to \infty} F_{X_n}(x) = F_X(x), \] for all \(x\) where \(F_X\) is continuous.

Example 5 Consider a sequence \(\{X_n\}_{n \ge 1}\) of random variables where \(X_n \sim \mathcal{N}(0,1/n)\). Show that \(X_n \xrightarrow{D} 0\).

Let \(F\) denote the CDF of the constant random variable \(X=0\), i.e., \[ F(x) = \begin{cases} 0, & x < 0 \\ 1, & x \ge 0 \end{cases} \]

Let \(Z\) denote the standard Gaussian random variable. Then \[ F_{X_n}(x) = \PR(X_n \le x) = \PR(Z \le \sqrt{n} x) \to \begin{cases} 1, & x > 0 \\ 0, & x < 0 \end{cases} \] Thus, \(F_{X_n}(x)\) converges to \(F(x)\) for all \(x \neq 0\). Recall that the definition of convergence in distribution, does not require convergence of \(F_{X_n}(x)\) at points of discontinuity of \(F\). So, \(X_n \xrightarrow{D} 0\).

An important implication of convergence in distribution is that for any continuous bounded function \(g\) \[ \EXP[g(X_n)] \to \EXP[g(X)]. \] For this reason, convergence in distribution is sometimes called weak convergence.

The relationship between PDFs and MGFs implies the following continuity theorem:

Continuity Theorem

Consider a sequence of random variables \(\{X_n\}_{n \ge 1}\). For ease of notation, we use \((F_n, M_n)\) to denote the CDF and MGF of \(X_n\).

  1. If \(F_n \to F\) for some CDF \(F\) with MGF \(M\), then \(M_n(s) \to M(s)\) for all \(s\).
  2. Conversely, if \(M(s) = \lim_{n\to ∞} M_n(s)\) exists and is continuous at \(s = 0\), then \(M\) is the MGF of some CDF \(F\) and \(F_n \to F\).

The proof is a bit involved and technical but makes sense from a signal processing point of view: we know that if a sequence of signals \(x_n \to x\), then \(\mathcal L(x_n) \to \mathcal L(x)\).

2.1 An illustrative example

Let \(\{X_i\}_{i \ge 1}\) be a sequence of i.i.d. \(\text{Ber}(p)\) random variables. Define \[ S_n = X_1 + \cdots + X_n. \] As we saw above, \[ M_{S_n}(s) = (1 - p + p e^s)^n \] which is the MGF of \(\text{Ber}(n,p)\) random variable.

Write \(q\) for \(1-p\) and \(σ_n^2\) for \(npq\) (which is the variance of Bernoulli random variable). Define \[ Z_n = \frac{S_n - np}{σ_n} \] We know that \[\begin{align*} M_{Z_n}(s) &= \EXP[ e^{s (S_n - np)/σ_n^2} ] = e^{-s n p/σ_n} \EXP[ e^{s S_n/σ_n}] = e^{-s n p/σ_n} M_{S_n}(s/σ_n). \\ &= e^{-s n p/σ_n} (q + p e^{s/σ_n})^n = (q e^{-s p/σ_n} + p e^{s q / σ_n})^n \end{align*}\] Let’s consider the power series expansion of \(q e^{-sp/σ_n} + pe^{sq/σ_n}\): \[\begin{align*} &q\left(1 - \frac{ps}{σ_n} + \frac{p^2 s^2}{2! σ_n^2} - \frac{p^3 s^3}{3! σ_n^3} + \cdots \right) + p\left(1 + \frac{qs}{σ_n} + \frac{q^2 s^2}{2! σ_n^2} + \frac{q^3 s^3}{3! σ_n^3} + \cdots \right) \\ & \quad= 1 + \frac{pq s^2}{2 σ_n^2} + \frac{pq(p-q) s^3}{6 σ_n^2} + \cdots \end{align*}\] Thus, for large values of \(n\), we have \(\log(1 + z)^n = n(z - z^2/2 + \cdots)\). Therefore, \[ \log M_{Z_n}(s) = \frac{s^2}{2} + \frac{(q-p) s^3}{6 \sqrt{n p q}} + \text{terms of order $\dfrac 1n$ or smaller} \] Thus, the logarithm of the MGF of \(Z_n\) matches that of the standard normal until the \(s^2/2\) term. As \(n\) tends to infinity, the remainder terms tend to zero.

This convergence of \(M_{Z_n}(s)\) to \(e^{s^2/2}\) can be used to rigorously prove that the distribution of “standard Binomial” converges to standard normal as \(n\) tends to infinity.

2.2 Two limit theorems

Building on the above example, we prove a form of the law of large numbers.1

1 Traditionally, these results are proved via the characteristic function and not the MGF. However, as discussed above, my personal view is that we can do everything with MGFs if we define them over complex numbers (as is done in Signals and Systems) rather than real numbers (as is done in probability theory).

Theorem 2 (A law of large numbers) Let \(\{X_n\}_{n \ge 1}\) be a sequence of i.i.d. random variables with finite mean \(μ\). Then, their partial sums \(S_n = X_1 + \cdots + X_n\) satisfy \[ \frac{1}{n} S_n \xrightarrow{D} μ \]

Proof

The theorem asserts that as \(n \to ∞\), \[ \PR(n^{-1}S_n \le x) = \begin{cases} 0 & \hbox{if } x < μ, \\ 1 & \hbox{if } x \ge μ. \end{cases} \]

Let \(M_X\) denote the MGF of \(X_n\). From Theorem 1, we know that \[ M_{n^{-1} S_n}(s) = M_X(s/n)^n. \]

From the Taylor series expansion of \(M_X(s)\), we know that \[ M_X(s/n) = 1 + μ s + o(s). \] Therefore, \[M_{n^{-1} S_n}(s) = \left(1 + \frac{μs}{n} + o\left(\frac{s}{n}\right) \right)^n \to \exp(μ s) \quad \text{as} \quad n \to ∞, \] which is the MGF of a constant random variable.

We will show later that convergence in distribution to a constant implies converge in probability. Therefore, the above implies the weak law of large numbers.

The above result shows that when \(n\) is large, the sum \(S_n\) is approximately the same as \(n μ\). The answer to this is provided by the Central Limit Theorem, which asserts that when \(X_n\) has finite variance \(σ\):

  • \(S_n - n μ\) is about as big as \(\sqrt{n}\).
  • Irrespective of the distribution of \(X_n\), \((S_n - n μ)/\sqrt{n}\) converges in distribution to a normal distribution with variance \(σ\).

We prove the result below.

Theorem 3 (Central Limit Theorem) Let \(\{X_n\}_{n \ge 1}\) be i.i.d. random variables with mean \(μ\) and finite non-zero variance \(σ^2\). Then, their partial sums \(S_n = X_1 + \cdots + X_n\) satisfy \[ \frac{S_n - n μ}{\sqrt{n}\, σ} \xrightarrow{D} {\cal N}(0,1). \]

Proof

The proof idea is similar to that of Theorem 2. Consider the “normalized” sequence of random variables \(Y_n = (X_n - μ)/σ\), which have mean zero and unit variance. Let \(M_Y\) denote their MGF. Then, the Taylor series expansion of \(M_Y\) gives \[ M_Y(s) = 1 + \frac{1}{2} s^2 + o(s^2). \] Moreover, observe that \[ Z_n \coloneqq \frac{S_n - n μ }{\sqrt{n}\, σ} = \frac{1}{\sqrt{n}} \sum_{i=1}^n Y_i. \] Therefore, by Theorem 1, we have \[ M_{Z_n}(s) = M_Y(s/\sqrt{n})^n = \left(1 + \frac {s^2}{2n} + o\left( \frac {s^2}{n} \right) \right)^n \to \exp(-\tfrac 12 s^2), \] which is the MGF of \({\cal N}(0,1)\).

The central limit theorem is one of the cornerstones of probability theory. The earliest statement of the result goes back to de Moivre (1773), but there was little follow up on that until Laplace Théorie analytique des probabilités, (1812). The term “central limit theorem” is due to Poyla (1920) who called is such because the limit theorem plays a “central role in probability theory”.

3 Moment generating function of random vectors

For random vectors, \(X \in \reals^n\), the MGF is a function \(M_X \colon \mathbb{C}^n \to \mathbb{C}\) and for any \(s \in \mathbb{C}^n\) is given by

\[M_X(s) = \EXP[e^{ \langle s, X \rangle }]. \]

TODO: Multivariate Gaussian. Relationship with correlation and covariance.