Introduction to Probability

Updated

October 7, 2024

1 Background

This is a graduate course on probability and random signals. I assume that everyone is familiar with the basics of undergraduate probability. For example, you should be able to answer the following questions:

  • A fair 6-sided die is rolled twice. What is the probability that the sum of the rolls equals 7?
  • A biased coin with \(\PR({\rm heads}) = 3/4\) is tossed 10 times. What is the probability of obtaining 3 consecutive heads?

You should also be familiar with the following concepts:

  • Random variables, probability distributions, and expectations
  • Conditional distributions and independent random variables

Some of you might have also seen the following concepts

  • Law of large numbers
  • Central limit theorem

In this course, we will revisit these topics with a more formal approach. We start with a review of the basic concepts.

2 Review of Set Theory

2.1 Basic set operations

A set is a collection of objects. We say that a set \(B\) is a subset of set \(A\) (written as \(B \subseteq A\)) if all elements of \(B\) are also elements of \(A\). We say that \(B\) is a proper subset (written \(B \subsetneq A\)) if \(B \subseteq A\) and \(B \neq A\).

Exercise 1 Let \(A = \{1, 2, 3\}\). Find all subsets of \(A\).

The set of all subsets of \(A\) is also called the power set of \(A\) (denoted by \(2^A\)). The notation \(2^A\) represents that the power set of \(A\) contains \(2^{|A|}\) elements. For example, your answer to Exercise 1 must have \(2^3 = 8\) elements.

Given two sets \(A\) and \(B\), we define the set difference \(A\setminus B\) to be all elements of \(A\) not in \(B\). Note that mathematically \(A \setminus B\) is well defined even if \(B \not\subseteq A\). In particular \[ A \setminus B = A \setminus (A \cap B). \]

Exercise 2 Compute \(A \setminus B\) for the following:

  • \(A = \{1,2,3,4\}\) and \(B = \{1, 2\}\).
  • \(A = \{1,2,3,4\}\) and \(B = \{1, 2, 5\}\).

Given a collection \(\{A_1, A_2, \dots, A_n\}\) of sets, we define two operations:

  • Union \(A_1 \cup A_2 \cup \cdots \cup A_n\) as follows \[ \bigcup_{i=1}^n A_i = \{ a: a \in A_i \text{ for some } i \}. \] This means that an element belongs to \(A_1 \cup A_2 \cup \cdots \cup A_n\) if it belongs to at least one of \(A_1\), \(A_2\), \(\ldots\), \(A_n\).

  • Intersection \(A_1 \cap A_2 \cap \cdots \cap A_n\) as follows \[ \bigcap_{i=1}^n A_i = \{ a: a \in A_i \text{ for all } i \} \] This means that an element belongs to \(A_1 \cap A_2 \cap \cdots \cap A_n\) if it belongs to all of \(A_1\), \(A_2\), \(\ldots\), \(A_n\).

A collection \(\{A_1, A_2, \dots, A_n\}\) is disjoint if for every \(i \neq j\), \(A_i \cap A_j = \emptyset\), where \(\emptyset\) denotes the empty set.

Given a universal set \(Ω\) and a collection \(\{B_1, B_2, \dots, B_m\}\) of subsets of \(Ω\), we say that \(\{B_1, B_2, \dots, B_m\}\) is a partition of \(Ω\) if \(\{B_1, B_2, \dots, B_m\}\) are disjoint and their union equals \(Ω\).

Figure 1: Example of a partition

Example 1 Let \(Ω = \{1,2,3,4\}\). The following are partitions of \(Ω\):

  • \(\{ \{1\}, \{2\}, \{3\}, \{4\} \}\).
  • \(\{ \{1, 2\}, \{3, 4\} \}\).
  • \(\{ \{1\}, \{2, 3\}, \{4\} \}\).

The follow are not partitions of \(Ω\) [Explain why?]

  • \(\{ \{1\}, \{2\}, \{3\}, \}\).
  • \(\{ \{1, 2, 3\}, \{3, 4\} \}\).
  • \(\{ \{1\}, \{2, 3\}, \{4, 5\} \}\).

In most of our discussion, we will work with a pre-specified universal set \(Ω\). In this setting we use \(A^c\) (read as the complement of \(A\)) as a short hand for \(Ω\setminus A\).

Partitions are useful because they allow breaking up a set into disjoint pieces. In particular, suppose \(\{B_1, \dots, B_m\}\) is a partition and \(A\) is any subset of \(Ω\).

Then, \[ A = (A \cap B_1) \cup (A \cap B_2) \cup \cdots \cup (A \cap B_m) \] where each of the components is disjoint.

2.2 Properties of set operations

  • Commutative \[A \cup B = B \cup A \quad\text{and}\quad A \cap B = B \cap A\]

  • Associative \[A \cup (B \cup C)= (A \cup B) \cup C \quad\text{and}\quad A \cap (B \cap C)= (A \cap B) \cap C\]

  • Distributive \[A \cup (B \cap C) = (A \cup B) \cap (A \cap C) \quad\text{and}\quad A \cap (B \cup C) = (A \cap B) \cup (A \cap C)\]

  • De Morgan’s Law \[(A \cup B)^c = A^c \cap B^\cap \quad\text{and}\quad (A \cap B)^c = A^c \cup B^c\]

Exercise 3 Use distributive property to simplify:

  • \([1,4] \cap ([0,2] \cup [3,5])\).
  • \([2,4] \cup ([3,5] \cap [1,4])\).

2.3 An algebra (or field) on sets

Given a universal set \(Ω\), a collection \(\ALPHABET F = \{F_1, F_2, \dots, F_m\}\) of subsets of \(Ω\) is called an algebra if it satisfies the following properties:

  • \(\emptyset \in \ALPHABET F\) and \(Ω \in \ALPHABET F\).
  • Closed under complements: if \(A \in \ALPHABET F\) then \(A^c \in \ALPHABET F\).
  • Closed under finite unions and finite intersections: if \(A_1, \dots, A_n \in \ALPHABET F\), then \[ A_1 \cup A_2 \cup \cdots \cup A_n \in \ALPHABET F \quad\text{and}\quad A_1 \cap A_2 \cap \cdots \cap A_n \in \ALPHABET F \]

We will sometimes use the notation “\((Ω,\ALPHABET F)\) is an algebra of sets” or “\(\ALPHABET F\) is an algebra on \(Ω\)”. Some examples of algebras are as follows:

  • The smallest algebra associated with \(Ω\) is \(\{\emptyset, Ω\}\).
  • If \(A\) is any subset of \(Ω\), then \(\{\emptyset, A, A^c, Ω\}\) is an algebra.
  • For any set \(Ω\), the power-set \(2^Ω\) is an algebra on \(Ω\). As an illustration, check that the power set defined in Exercise 1 is an algebra.

These are all examples of a general principle: The power-set of any partition of a set is an algebra.

  • If the partition is \(\{Ω\}\), then the power-set \(\{ \emptyset, Ω \}\) is an algebra.
  • If the partition is \(\{A, A^c\}\), then the power-set \(\{ \emptyset, A, A^c, Ω\}\) is an algebra.
  • If the partition is the collection of all singleton elements of a set, then the power-set \(2^Ω\) is an algebra.

3 Probability Space

Probability is a measure of uncertainty or our belief that a particular statement is true. In this course, we will not concern ourselves with how such a measure of uncertainty is constructed; rather focus on the mathematical properties that it should satisfy and the implications of these properties.

Many everyday statements take the form: “The chance (or probability) of \(A\) is \(p\)”, where \(A\) is some event (such as “sun shining tomorrow”, “Team A winning a hockey game”, etc.) and \(p\) is a number (e.g., \(1/8\)) or an adjective describing quantity (e.g., “low”).

To mathematically model such statements, we need to model the sequence of events that may lead to the occurrence of \(A\): this is called a random experiment; the result of an experiment is called an outcome.

In general, the outcome of an experiment is not certain. We can only talk about the collection of possible outcomes. The collection of possible outcomes of an experiment is called the sample space and denoted by \(Ω\).

Exercise 4 What is the sample space for the toss of a coin?

\(Ω = \{ H, T \}\).

Exercise 5 What is the sample space for the roll of a (6-sided) die?

\(Ω = \{ 1,2,3,4,5,6 \}\).

An event is any subset of the sample space. If the outcome of the random experiment belongs to the event \(A\), we say that “the event \(A\) has occurred”. Some examples of events are:

  • Head occurs in Exercise 4 (\(A = \{H\}\))
  • Both head and tail occur in Exercise 4 (\(A = \emptyset\); this is an event that cannot happen, sometimes called the impossible event)
  • An even number is thrown in Exercise 5 (\(A = \{2,4,6\}\)).

Note that events are subset of the sample space but not all subsets of a sample space may be events. The reasons are too complicated to explain, but the high-level explanation is that everything is okay for discrete sample spaces, but weird things can happen in continuous sample spaces.

Probability (denoted by \(\PR\)) is a function which assigns a number between \(0\) and \(1\) to every event. This number indicates what is the chance that the event occurs. Such a function should satisfy some axioms, which we will explain below.

First, to define a function, we need to define it’s domain and range. Let’s denote the domain (i.e., the set of all events to which we can assign a probability) by \(\ALPHABET F\). We expect probability to satisfy certain properties, which imposes constraints on the domain:

  • Probability of an impossible event (e.g., getting both heads and tails when we toss a coin) should be zero. Thus, \(\emptyset \in \ALPHABET F\).

  • Probability of something happening (e.g., getting either a head or a tail when we toss a coin) should be one. Thus, \(Ω \in \ALPHABET F\).

  • If we assign probability to an event \(A\) then we should be able to assign probability to “\(A\) does not occur” i.e., \(A^c\). Thus, if \(A \in \ALPHABET F\) then \(A^c \in \ALPHABET F\).

  • If we can talk about probability of \(A\) and \(B\), then we should be able to talk about probability that either \(A\) or \(B\) occurs and both \(A\) and \(B\) occur. Thus, if \(A, B \in \ALPHABET F\), then \(A \cup B \in \ALPHABET F\) and \(A \cap B \in \ALPHABET F\).

Thus, the domain of \(\PR\) must be an algebra! However, when we go beyond finite sample spaces, being an algebra is not sufficient. But we first provide some examples of probability for finite sample spaces.

Example 2 Consider a six-sided die where \(Ω = \{1, 2, \dots, 6 \}\), \(\ALPHABET F = 2^Ω\) and \(\PR\) is given by \[ \PR(A) = \sum_{ω \in A} q(ω), \quad \forall A \in \ALPHABET F \] where \[ q(1) = q(2) = q(3) = q(4) = q(5) = \frac 2{15} \quad\text{and}\quad q(6) = \frac{1}{3}. \]

Verify that

  • \(\PR(Ω) = 1\)
  • \(\PR(\{3,4,5\}) = \frac{6}{15}\).
  • \(\PR(\{1,3,4,5\}) = \frac{8}{15}\).

We now come back to the fact that restricting the domain of \(\PR\) to be an algebra is not sufficient as is illustrated by the following example.

Example 3 A coin is tossed repeatedly until a head turns up. The sample space is \(Ω = \{ω_1, ω_2, \dots\}\) where \(ω_n\) denotes the event that the first \(n-1\) tosses are tails followed by a head.

Suppose we are interested in finding the probability of the event that the coin is tossed an even number of times, i.e., \(A = \{ω_2, ω_4, \dots\}\). Note that \(ω_2, ω_4, \dots \in \ALPHABET F\). However, \(A\) is a set. If we want to assign probability to \(A\) in terms of probability of \(ω_n\), we require \(\ALPHABET F\) to be closed under countable unions. This motivates the following definition.

\(σ\)-algebra

Given a universal set \(Ω\), a collection \(\ALPHABET F = \{F_1, F_2, \dots\}\) of subsets of \(Ω\) is called a \(\boldsymbol{σ}\)-algebra if it satisfies the following properties:

  • \(\emptyset \in \ALPHABET F\) and \(Ω \in \ALPHABET F\).
  • Closed under complements: if \(A \in \ALPHABET F\) then \(A^c \in \ALPHABET F\).
  • Closed under countable unions: if \(A_1, A_2, \dots \in \ALPHABET F\), then \[ \bigcup_{n=1}^∞ A_n \in \ALPHABET F \]

The distinction between algebras and \(σ\)-algebras is technical. The reason that we need to consider \(σ\)-algebras is to do with the definition of probability on continuous sample spaces. Take \(Ω = [0,1]\) and consider a random experiment where “any outcome is equally likely”. Intuitively we capture this feature by assuming that for any interval \([a,b]\) with \(0 \le a \le b \le 1\), we have \[\begin{equation}\label{eq:uniform} \PR([a,b]) = b - a. \end{equation}\]

We have seen that if we want \(\PR\) to be a meaningful measure, the domain \(\ALPHABET F\) must at least be an algebra. We have also seen that the power-set \(2^Ω\) is always an algebra. So, it is tempting to take \(\ALPHABET F = 2^{[0,1]}\). However, it turns out that \(2^{[0,1]}\) has includes some weird sets (technically, non-measurable sets) due to which we cannot define a function \(\PR\) on \(2^{[0,1]}\) that satisfies \(\eqref{eq:uniform}\).

To workaround this technical limitation, we revisit the minimum requirements that we need from the domain of \(\PR\). Since we are interested in \(\PR([a,b])\), \(\ALPHABET F\) must contain intervals (and therefore all finite unions and intersections of intervals). Since we are working with continuous sample spaces, we also want \(\PR\) to be continuous, i.e., for any sequence of sets \(\{A_n\}_{n \ge 1}\), we want \(\PR(\lim_{n \to ∞} A_n) = \lim_{n \to ∞} \PR(A_n)\). It turns out that the additional requirement of continuity implies that \(\ALPHABET F\) must be closed under countable unions as well. Thus, the domain \(\ALPHABET F\) must at least be a \(σ\)-algebra.

So, we restrict to the simplest choice of the domain \(\ALPHABET F\) needed for \(\eqref{eq:uniform}\) and continuity to hold. For technical reasons, we need another property known as completeness. See Sec. 1.6 of the textbook.

\(σ\)-algebra generated by a collection and Borel \(σ\)-algebra

Given collection \(\ALPHABET S\) of subsets of \(Ω\), we have the following:

  • The power-set \(2^Ω\) contains \(\ALPHABET S\). Therefore, there is at least one \(σ\)-algebra containing \(\ALPHABET S\).
  • If \(\ALPHABET F_1\) and \(\ALPHABET F_2\) are \(σ\)-algebras containing \(\ALPHABET S\), then \(\ALPHABET F_1 \cap \ALPHABET F_2\) also contains \(\ALPHABET S\).

Thus, if we take the intersection of all \(σ\)-algebras containing \(\ALPHABET S\), we get the smallest \(σ\)-algebra containing \(\ALPHABET S\), which is sometimes denoted by \(σ(\ALPHABET S)\).

One commonly used \(σ\)-algebra is the Borel \(σ\)-algebra, which is defined as follows.1 Let \(Ω\) be a subset of \(\reals\) and \(\ALPHABET S\) be the collection of all open intervals in \(Ω\). Then \(σ(\ALPHABET S)\) is called the “Borel \(σ\)-algebra on Ω” and often denoted by \(\mathscr{B}(Ω)\).

1 Borel \(σ\)-algebra is usually defined for any topological space. We restrict our definition to subsets of reals.

Definition 1 (Probability space) A probability space is a tuple \((Ω, \ALPHABET F, \PR)\) comprising of a set \(Ω\), a \(σ\)-algebra \(\ALPHABET F\) on \(Ω\), and a function \(\PR \colon \ALPHABET F \to [0,1]\) that satisfies the following axioms of proability

  1. Non-negativity. \(\PR(A) \ge 0\).
  2. Normalization. \(\PR(Ω) = 1\).
  3. Countable additivity. If \(A_1, A_2, \dots Ω\) is a collection of disjoint events in \(\ALPHABET F\), then, \[ \PR\biggl( \bigcup_{n=1}^∞ A_n \biggr) = \sum_{n=1}^∞ \PR(A_n). \]

Some immediate implications of the axioms of probability are the following.

Lemma 1 (Properties of probability measures)  

  1. Probability of complement. \(\PR(A^c) = 1 - \PR(A)\).

  2. Monotonicity. If \(A \subset B\), then \(\PR(B) = \PR(A) + \PR(B \setminus A) \ge \PR(A)\).

  3. Inclusion-exclusion. Given two events \(A\) and \(B\), \[ \PR(A \cup B) = \PR(A) + \PR(B) - \PR(A \cap B). \]

  4. Continuity. Let \(A_1, A_2, \dots\) be (weakly) increasing sequence of events, i.e., \(A_1 \subseteq A_2 \subseteq A_3 \subseteq \cdots\). Define \[ A = \lim_{n \to ∞} A_n = \bigcup_{n=1}^∞ A_n. \] Then, \[ \PR(A) = \lim_{n \to ∞} \PR(A_n). \]

    Similarly, let \(B_1, B_2, \dots\) be (weakly) decreasing sequence of events, i.e., \(B_1 \supseteq B_2 \supseteq B_3 \supseteq \cdots\). Define \[ B = \lim_{n \to ∞} B_n = \bigcup_{n=1}^∞ B_n. \] Then, \[ \PR(B) = \lim_{n \to ∞} \PR(B_n). \]

  5. Union bound. For any sequence of events \(\{A_n\}_{n \ge 1}\), we have \[ \PR\biggl( \bigcup_{n=1}^{∞} A_n \biggr) \le \sum_{n=1}^{∞} \PR(A_n). \]

The proof of parts (a)–(c) is elementary and left as an exercise. Part (d) is more technical and is essentially equivalent to countable additivity. See the textbook for a proof. Union bound is an immediate consequence of inclusion-exclusion and continuity.

Example 4 Let \(Ω = [0,1]\), \(\ALPHABET F = \mathscr B[0,1]\), and \(\PR\) be any probability measure on \((Ω, \ALPHABET F)\). Take any \(a \in (0,1)\).

  • Consider \(A_n = \bigl[0, a - \frac 1n \bigr)\). Then \(A = \lim_{n \to ∞} A_n = [0, a)\).
  • Consider \(A_n = \bigl[0, a - \frac 1n \bigr]\). Then \(A = \lim_{n \to ∞} A_n = [0, a)\).
  • Consider \(B_n = \bigl[0, 1 + \frac 1n \bigr)\). Then \(B = \lim_{n \to ∞} B_n = [0,a]\).
  • Consider \(B_n = \bigl[0, 1 + \frac 1n \bigr]\). Then \(B = \lim_{n \to ∞} B_n = [0,a]\).

In these examples, continuity implies that \(\PR(A) = \lim_{n \to ∞} \PR(A_n)\) and \(\PR(B) = \lim_{n \to ∞} \PR(B_n)\).

Some terminology
  • An event \(A\) is called null if \(\PR(A) = 0\). Null event should not be confused with impossible event \(\emptyset\).

  • We say that \(A\) occurs almost surely (abbreviated to a.s.) if \(\PR(A) = 1\).

Example 5 Consider \(Ω = [0,1]\), \(\ALPHABET F = \mathscr B([0,1])\), and \(\PR\) to be the uniform probability distribution on \(Ω\). Consider the event \(A\) that the outcome is a rational number. \(A\) is a countable set (because the set of rational numbers is countable). For any \(x \in A\), \(\{x\} \in \ALPHABET F\), and \(\PR(\{x\}) = 0\) (we can infer this from Example 4 by thinking of \(\{x\}\) as the limit of intervals \(\bigl[x, x+ \frac 1n\bigr]\)). Thus, by countable additivity, \(\PR(A) = 0\). Hence, \(A\) is null.

The above analysis implies that \(\PR(A^c) = 1\), thus the event that the outcome is irrational occurs almost surely.

4 Conditional Probability

Conditional probabilities quantify the uncertainty of an event when it is known that another event has occurred

Definition 2 Let \((Ω,\ALPHABET F, \PR)\) be a probability space and \(A, B \in \ALPHABET F\) such that \(\PR(B) > 0\). Then, the conditional probability that \(A\) occurs given that \(B\) occurs is defined as \[ \PR(A | B) = \dfrac{ \PR(A \cap B) }{ \PR(B) }. \]

The notation \(\PR(A | B)\) is read as “probability of \(A\) given \(B\)” or “probability of \(A\) conditioned on \(B\)”.

Exercise 6 Suppose we roll a fair six-sided die (a fair die means that all outcomes are equally likely). Consider the events \(A\) that the outcomes is prime and \(B\) that the outcome is a multiple of \(3\). Compute \(\PR(A | B)\) and \(\PR(B | A)\).

We have \(Ω = \{1, 2, 3, 4, 5, 6\}\), \(A = \{2, 3, 5\}\), and \(B = \{3, 6\}\). Note that \(\PR(A) = \frac 12\) and \(\PR(B) = \frac 13\).

Thus, \[ \PR(A | B) = \frac{ \PR(A \cap B) }{ \PR(B) } = \frac{ \PR(\{3\}) }{ \PR(\{3,6\}) } = \frac{ \ABS{\{3\}} }{ \ABS{\{3,6\}} } = \frac {1}{2}. \] Similarly, \[ \PR(B | A) = \frac{ \PR(B \cap A) }{ \PR(A) } = \frac{ \PR(\{3\}) }{ \PR(\{2,3,5\}) } = \frac{ \ABS{\{3\}} }{ \ABS{\{2,3,5\}} } = \frac {1}{3}. \]

Exercise 7 Suppose we roll two fair six-sided dice. Consider the event \(A\) that the maximum of the two rolls is less than or equal to \(8\) and the event \(B\) that the minimum of the two rolls is greater than or equal to \(6\). Compute \(\PR(A|B)\) and \(\PR(B|A)\).

Note that \(Ω = \{ 1, 2,3, 4, 5, 6\}^2\) and \(\PR\) is uniform probability on all outcomes. The sets \(A\), \(B\), and \(A \cap B\) are shown in Figure 2. Note that \(\PR(A) = \PR(B) = \frac{26}{36} = \frac{13}{18}\).

Figure 2: The different events in Exercise 7

Thus, we have \[ \PR(A|B) = \frac{ \PR(A \cap B) } { \PR(B) } = \frac{ \ABS{ A \cap B} }{ \ABS{B} } = \frac{16}{26} = \frac{8}{13} \] and \[ \PR(B|A) = \frac{ \PR(B \cap A) } { \PR(A) } = \frac{ \ABS{ B \cap A} }{ \ABS{A} } = \frac{16}{26} = \frac{8}{13} \]

Conditional probabilities are probabilities

Conditional probabilities are legitimate probability measures on \((Ω, \ALPHABET F)\). In particular, fix event \(B\) with \(\PR(B) > 0\). Then

  • \(\PR(A \mid B) \ge 0\).
  • \(\PR(Ω \mid B) = \dfrac{\PR(Ω \cap B)}{\PR(B)} = 1\).
  • For disjoint events \(A_1, A_2 \in \ALPHABET F\), \[\PR(A_1 \cup A_2 \mid B) = \frac{ \PR( (A_1 \cup A_2) \cap B) }{ \PR(B) } = \frac{ \PR( (A_1 \cap B) \cup (A_2 \cap B) ) }{ \PR(B) } = \frac{ \PR(A_1 \cap B) + \PR (A_2 \cap B) }{ \PR(B) } = \PR(A_1 \mid B) + \PR(A_2 \mid B)\] where we have used the fact that \((A_1 \cap B)\) and \((A_2 \cap B)\) are disjoint.

Exercise 8 Given an event \(B\) with \(\PR(B) > 0\), show that

  • \(\PR(A^c | B) = 1 - P(A|B)\).
  • \(\PR(A_1 \cup A_2 | B) = \PR(A_1 | B) + \PR(A_2 | B) - \PR(A_1 \cap A_2 | B)\).
  • If \(A_1 \subset A_2\) then \(\PR(A_1 | B) \le \PR(A_2 | B)\).

The definition of conditional probability gives rise to the chain rule.

Lemma 2 (Chain rule of probability) Let \(A\) and \(B\) be events in a probability space \((Ω, \ALPHABET F, \PR)\).

  • If \(\PR(B) > 0\), then \(\PR(A \cap B) = \PR(A | B) \PR(B)\).
  • If \(\PR(A) > 0\), then \(\PR(A \cap B) = \PR(B | A) \PR(A)\).

Combining the chain rule with the basic properties of partitions, we get the law of total probability.

Lemma 3 (Law of total probability) Let \(\{B_1, B_2, \dots, B_m\}\) be a partition of \(Ω\) such that \(\PR(B_i) > 0\) for all \(i\). Then, \[ \PR(A) = \sum_{i=1}^m \PR(A \cap B_i) = \sum_{i=1}^m \PR(A | B_i) \PR(B_i). \]

Figure 3: Illustration of Law of total probability

Consider \(m=2\), in which case the result can be simplified as \[\PR(A) = \PR(A|B)\PR(B) + \PR(A|B^c) \PR(B^c).\]

To prove this observe that \[\begin{equation}\label{eq:two-step} A = A \cap (B \cup B^c) = (A \cap B) \cup (A \cap B^c). \end{equation}\] The events \(A \cap B\) and \(A \cap B^c\) are disjoint. Therefore, by additivity, we have \[ \PR(A) = \PR(A \cap B) + \PR(A \cap B^c). \] Then, by the definition of conditional probability, we have \(\PR(A \cap B) = \PR(A|B) \PR(B)\) and \(\PR(A \cap B^c) = \PR(A|B^c) \PR(B^c)\). Substituting in the above, we get \(\eqref{eq:two-step}\).

The argument for the general case is similar.

Exercise 9 There are two routes for a packet to be transmitted from a source to the destination.

  • The packet takes route \(R_1\) with probability \(\frac 34\) and takes route \(R_2\) with probability \(\frac 14\).
  • On route \(R_1\), the packet is dropped with probability \(\frac 13\).
  • On route \(R_2\), the packet is dropped with probability \(\frac 14\).

Find the probability that the packet reaches the destination.

We start by defining some events: Let \(R_1\) denote the event that the packet took route \(R_1\) and \(R_2\) denote the event that the packet took route \(R_2\). Let \(D\) denote the event that the packet was dropped.

Then, the information given in the question can be written as:

  • \(\PR(R_1) = \frac 34\) and \(\PR(R_2) = \frac 14\).
  • \(\PR(D | R_1) = \frac 13\). Thus, \(\PR(D^c | R_1) = 1 - \PR(D | R_1) = \frac 23\).
  • \(\PR(D | R_2) = \frac 14\). Thus, \(\PR(D^c | R_2) = 1 - \PR(D | R_2) = \frac 34\).

Then, by the law of total probability, we have \[\begin{align*} \PR(D^c) &= \PR(D^c | R_1) \PR(R_1) + \PR(D^c | R_2) \PR(R_2) \\ &= \frac 34 \frac 23 + \frac 14 \frac 34 = \frac {11}{16}. \end{align*}\]

Lemma 4 (Bayes rule) For any events \(A, B \in \ALPHABET F\) such that \(\PR(A), \PR(B) > 0\), we have \[ \PR(B|A) = \dfrac{\PR(A|B)\PR(B)}{\PR(A)}. \]

In general, if \(\{B_1, B_2, \dots, B_m\}\) is a partition of \(Ω\) such that \(\PR(B_i) > 0\) for all \(i\). Then, \[ \PR(B_i|A) = \dfrac{ \PR(A|B_i) \PR(B_i) } {\displaystyle \sum_{j=1}^m \PR(A|B_j) \PR(B_j)} \] where we have used the law of total probability (Lemma 3) in the denominator.

Exercise 10 Consider the model of Exercise 9. Suppose we know that the packet was dropped. What is the probability that it was transmitted via route \(R_1\)?

Recall the events \(R_1\), \(R_2\), and \(D\) defined in the solution of Exercise 9. We were given that \[\PR(R_1) = \frac 34, \quad \PR(R_2) = \frac 14, \quad \PR(D|R_1) = \frac 13, \quad \PR(D|R_2) = \frac 14.\] We had compute that \[ \PR(D) = 1 - \PR(D^c) = \frac 5{16}. \]

Thus, by Bayes rule, we have \[ \PR(R_1 | D) = \frac{ \PR(D | R_1) \PR(R_1) }{ \PR(D) } = \frac{ \frac 13 \frac 34 } { \frac 5{16} } = \frac 45. \]

5 Independence

In general, the knowledge that an event \(B\) has occurred changes the probability of event \(A\), since \(\PR(A)\) is replaced by \(\PR(A|B)\). If the knowledge that \(B\) has occurred does not does not change our belief about \(A\), i.e., when \(\PR(A|B) = \PR(A)\), we say “\(A\) and \(B\) are independent”. This leads to the following definition.

Definition 3 The events \(A, B \in \ALPHABET F\) are called independent if \[ \PR(A|B) = \PR(A) \quad\text{or}\quad \PR(B|A) = \PR(B). \] An alternative but equivalent definition is \[ \PR(A \cap B) = \PR(A) \PR(B). \]

We will use the notation \(A \independent B\) to denote that the events \(A\) and \(B\) are independent.

It is common for students to make the mistake and think that independence means \(A \cap B = \emptyset\). This is not true!

Example 6 The events \(A\) and \(B\) defined in Exercise 6 are independent.

Example 7 The events \(A\) and \(B\) defined in Exercise 7 are not independent.

Exercise 11 \(A \independent B\) implies the following:

  • \(A \independent B^c\).
  • \(A^c \independent B\).
  • \(A^c \independent B^c\).
Independence of \(σ\)-algebras

In the discussion below, we assume that the probability space \((Ω, \ALPHABET F, \PR)\) is fixed.

  1. Two sub-\(σ\)-algebras \(\ALPHABET F_1\) and \(\ALPHABET F_2\) of \(\ALPHABET F\) are said to be independent if every event \(A_1 \in \ALPHABET F_1\) is independent of every event \(A_2 \in \ALPHABET F_2\).

  2. For any event \(A\), let \(σ(A)\) denote the smallest \(σ\)-algebra containing \(A\), i.e., \(σ(A) = \{\emptyset, A, A^c, Ω\}\).

  3. Exercise 11 implies that independence of \(A\) and \(B\) implies the independence of \(σ(A)\) and \(σ(B)\). The reverse implication is trivially true. Thus, independence of events is equivalent to the independence of the smallest \(σ\)-algebra containing those events.

Definition 4 A family of events \(\{A_1, A_2, \dots, A_n\}\) is called independent (sometimes mutually independent if for all non-empty subset of indices \(\{k_1, \dots, k_m\} \subset \{1,\dots,n\}\), we have \[ \PR\bigl( A_{k_1} \cap A_{k_2} \cap \cdots \cap A_{k_m} \bigr) = \PR(A_{k_1}) \PR(A_{k_2}) \cdots \PR(A_{k_m}). \]

Exercise 12 Three bits are transmitted over a noisy channel. For each bit, the probability of correct reception is \(λ\). The error events for the three transmissions are mutually independent. Find the probability that two bits are received correctly.

For \(i \in \{1, 2, 3\}\), let

  • \(E_i\) denote the event that bit \(i\) is received incorrectly
  • \(C_i\) denote the event that bit \(i\) is received correctly

Moreover let \(S\) denote the event that two bits are received correctly. Then, \[ S = (C_1 \cap C_2 \cap E_3) \cup (C_1 \cap E_2 \cap C_3) \cap (E_1 \cap C_2 \cap C_3). \] Note that the three events in the right hand side are disjoint. Thus, \[\begin{align*} \PR(S) &= \PR(C_1 \cap C_2 \cap E_3) + \PR(C_1 \cap E_2 \cap C_3) + \PR(E_1 \cap C_2 \cap C_3) \\ &= \PR(C_1)\PR(C_2)\PR(E_3) + \PR(C_1)\PR(E_2)\PR(C_3) + \PR(E_1)\PR(C_2)\PR(C_3) \\ &= 3 (1-λ) λ^2. \end{align*}\]

Pairwise independence vs independence

A family of events \(\{A_1, A_2, \dots, A_n\}\) is pairwise independent if for every \(i,j \in \{1, \dots, n\}\), \(i \neq j\), we have \[ \PR(A_i \cap A_j) = \PR(A_i) \PR(A_j). \]

Pairwise independence is weaker than Independence. For instance, three events \(A\), \(B\), and \(C\) are pairwise independent if \[ \PR(A \cap B) = \PR(A) \PR(B), \quad \PR(B \cap C) = \PR(B) \PR(C), \quad\text{and}\quad \PR(C \cap A) = \PR(C) \PR(A). \] For independence, in addition to the above, we also need \[ \PR(A \cap B \cap C) = \PR(A)\PR(B) \PR(C). \]

The following example illustrates shows that independence is stronger than pairwise independence. Consider an urn with \(M\) red balls and \(M\) blue balls. Two balls are drawn at random, one at a time, with replacement. Consider the following events:

  • \(A\) is the event that the first ball is red.
  • \(B\) is the event that the second ball is blue.
  • \(C\) is the event that both balls are of the same color.

Observe that

  • \(A \cap B\) is the event that the first is red and second is blue.
  • \(B \cap C\) is the event that both balls are blue.
  • \(C \cap A\) is the event that both balls are red.
  • \(A \cap B \cap C = \emptyset\)

Therefore,

  • \(\PR(A) = \PR(B) = \PR(C) = \frac 14\).
  • \(\PR(A \cap B) = \PR(B \cap C) = \PR(C \cap A) = \frac 14\).
  • \(\PR(A \cap B \cap C) = \emptyset\).

Thus, \(A\), \(B\), \(C\) are pairwise independent but not independent.

6 Product spaces

So far, we have restricted attention to the outcome of one experiment. It is also possible to construct probability models which combine the outcome of two independent experiments, e.g., suppose we toss a coin and also roll a die. Let \((Ω_1, \ALPHABET F_1, \PR_1)\) and \((Ω_2, \ALPHABET F_2, \PR_2)\) be the probability spaces associated with the two experiments? What is the probability space \((Ω, \ALPHABET F, \PR)\) of the joint experiments?

The sample space should obviously be \(Ω = Ω_1 \times Ω_2\). When \(\ALPHABET F_1\) and \(\ALPHABET F_2\) are finite, then we can simply define \(\ALPHABET F = \ALPHABET F_1 \times \ALPHABET F_2\) and for any \(A = (A_1, A_2) \in \ALPHABET F\), \(\PR(A)\) to be \(\PR(A_1) \PR(A_2)\). You would have implicitly consutrcted such product spaces when dealing with joint experiments (like the coin toss and die roll example above) in your undergrad courses.

However, things are a bit more complicated when \(\ALPHABET F_1\) and \(\ALPHABET F_2\) are not finite. The difficulty is that \(\ALPHABET F_1 \times \ALPHABET F_2\) is not a \(σ\)-algebra. So, take \(\ALPHABET F\) to be \(σ(\ALPHABET F_1 \times \ALPHABET F_2)\) (which is the smallest \(σ\)-algebra comtaining \(\ALPHABET F_1 \times \ALPHABET F_2\)) and define \(\PR\) to be the extension of \(\PR_1 \times \PR_2\) from \(\ALPHABET F_1 \times \ALPHABET F_2\) to \(σ(\ALPHABET F_1 \times \ALPHABET F_2)\) (one can show that such an extension exists). Such product space is often written as \[ (Ω, \ALPHABET F, \PR) = (Ω_1 \times Ω_2, \ALPHABET F_1 \otimes \ALPHABET F_2, \PR_1 \otimes \PR_2). \]

We will not worry too much about the technical details of such product spaces, but will use the above notation at times in the course.