1  Probability spaces

Updated

September 13, 2025

1.1 Background

This is a graduate course on probability and random signals. I am going to assume that everyone is familiar with the basics of undergraduate probability. For example, you should be able to answer the following questions:

  • A fair 6-sided die is rolled twice. What is the probability that the sum of the rolls equals 7?
  • A biased coin with \(\PR({\rm heads}) = 3/4\) is tossed 10 times. What is the probability of obtaining 3 consecutive heads?

You should also be familiar with the following concepts:

  • Random variables, probability distributions, and expectations
  • Conditional distributions and independent random variables

Some of you might have also seen the following concepts

  • The law of large numbers
  • The central limit theorem

In this course, we will revisit these topics with a more formal approach.

1.2 Why do we care about probability theory

Probability theory is used to model, analyze, and design various engineering systems. Broadly speaking, there are three forms of random phenomenon that arise in systems: noise, uncertainty, and randomness.

  • Noise. Noise refers to effects beyond our control. For example, in a digital communication system, when a waveform indicating a binary \(0\) is transmitted, background radiation may distort the signal, causing the receiver to incorrectly decode it as \(1\). This interference can be modeled as noise in the channel. Similar examples arise in computer networks, photonics, and other areas.

  • Uncertainty. Uncertainty refers to effects that arise due to lack of information. For example, to bid in an electricity market, a wind farm may need to know how much energy it can generate in the next hour, but this depends on the speed of wind which depends on a lot of factors and is difficult to predict precise. This lack of knowledge can be modelled as uncertainty and the system modeler may have a quantified belief on that uncertainty based on historical data. Similar examples arise in circuits (where quality of a chip may depend on the vagaries in the manufacturing process).

  • Randomness. Sometimes introducing randomness can improve system performance. For example, power of two choices algorithm in multi-server load balancing randomly selects two servers and assigns the task to the one with lighter load. This simple strategy effectively reduces the peak load in the servers without needing to query all servers. Similar algorithms are used in bandwidth allocation in ad-hoc networks and randomized routing in networks on chips.

In this course, we will not focus on how such probabilistic models are constructed. Instead we will study the mathematical properties these models should satisfy and explore the implications of those properties. For the modeling and application-specific details, consider the following courses:

  • ECSE 506: Stochastic Control and Decision Theory
  • ECSE 508: Multi-agent systems
  • ECSE 510: Filtering and Prediction for Stochastic Systems
  • ECSE 511: Introduction to Digital Communication
  • ECSE 515: Optical Fibre Communications
  • ECSE 518: Telecommunication Network Analysis
  • ECSE 521: Digital Communication 1
  • ECSE 541: Design of Multiprocessor Systems-on-Chip
  • ECSE 551: Machine Learning for Engineers
  • ECSE 554: Applied Robotics
  • ECSE 608: Machine Learning
  • ECSE 610: Wireless Communication
  • ECSE 620: Information Theory
  • ECSE 621: Statistical Detection and Estimation
  • ECSE 623: Digital Communications 2
  • ECSE 626: Statistical Computer Vision

As you can see, probability theory is a foundational tool for Electrical Engineering.

1.3 Review of Set Theory

  1. A set is a collection of objects. We say that a set \(B\) is a subset of set \(A\) (written as \(B \subseteq A\)) if all elements of \(B\) are also elements of \(A\). We say that \(B\) is a proper subset (written \(B \subsetneq A\)) if \(B \subseteq A\) and \(B \neq A\).

    Exercise 1.1 Let \(A = \{1, 2, 3\}\). Find all subsets of \(A\).

  2. The set of all subsets of \(A\) is also called the power set of \(A\) (denoted by \(2^A\)). The notation \(2^A\) represents that the power set of \(A\) contains \(2^{|A|}\) elements. For example, your answer to Exercise 1.1 must have \(2^3 = 8\) elements.

  3. Given two sets \(A\) and \(B\), we define the set difference \(A\setminus B\) to be all elements of \(A\) not in \(B\). Note that mathematically \(A \setminus B\) is well defined even if \(B \not\subseteq A\). In particular \[ A \setminus B = A \setminus (A \cap B). \]

    Exercise 1.2 Compute \(A \setminus B\) for the following:

    • \(A = \{1,2,3,4\}\) and \(B = \{1, 2\}\).
    • \(A = \{1,2,3,4\}\) and \(B = \{1, 2, 5\}\).
  4. Given a collection \(\{A_1, A_2, \dots, A_n\}\) of sets, we define two operations:

    • Union \(A_1 \cup A_2 \cup \cdots \cup A_n\) as follows \[ \bigcup_{i=1}^n A_i = \{ a: a \in A_i \text{ for some } i \}. \] This means that an element belongs to \(A_1 \cup A_2 \cup \cdots \cup A_n\) if it belongs to at least one of \(A_1\), \(A_2\), \(\ldots\), \(A_n\).

    • Intersection \(A_1 \cap A_2 \cap \cdots \cap A_n\) as follows \[ \bigcap_{i=1}^n A_i = \{ a: a \in A_i \text{ for all } i \} \] This means that an element belongs to \(A_1 \cap A_2 \cap \cdots \cap A_n\) if it belongs to all of \(A_1\), \(A_2\), \(\ldots\), \(A_n\).

  5. A collection \(\{A_1, A_2, \dots, A_n\}\) is disjoint if for every \(i \neq j\), \(A_i \cap A_j = \emptyset\), where \(\emptyset\) denotes the empty set.

  6. Given a universal set \(Ω\) and a collection \(\{B_1, B_2, \dots, B_m\}\) of subsets of \(Ω\), we say that \(\{B_1, B_2, \dots, B_m\}\) is a partition of \(Ω\) if \(\{B_1, B_2, \dots, B_m\}\) are pairwise disjoint and their union equals the universal set \(Ω\).

    Figure 1.1: Example of a partition

    Example 1.1 Let \(Ω = \{1,2,3,4\}\). The following are partitions of \(Ω\):

    • \(\{ \{1\}, \{2\}, \{3\}, \{4\} \}\).
    • \(\{ \{1, 2\}, \{3, 4\} \}\).
    • \(\{ \{1\}, \{2, 3\}, \{4\} \}\).

    The follow are not partitions of \(Ω\) [Explain why?]

    • \(\{ \{1\}, \{2\}, \{3\}, \}\).
    • \(\{ \{1, 2, 3\}, \{3, 4\} \}\).
    • \(\{ \{1\}, \{2, 3\}, \{4, 5\} \}\).
  7. In most of our discussion, we will work with a pre-specified universal set \(Ω\). In this setting we use \(A^c\) (read as the complement of \(A\)) as a short hand for \(Ω\setminus A\).

  8. Partitions are useful because they allow breaking up a set into disjoint pieces. In particular, suppose \(\{B_1, \dots, B_m\}\) is a partition and \(A\) is any subset of \(Ω\).

    Then, \[ A = (A \cap B_1) \cup (A \cap B_2) \cup \cdots \cup (A \cap B_m) \] where each of the components is disjoint.

  9. Properties of set operations

    • Commutative \[A \cup B = B \cup A \quad\text{and}\quad A \cap B = B \cap A\]

    • Associative \[A \cup (B \cup C)= (A \cup B) \cup C \quad\text{and}\quad A \cap (B \cap C)= (A \cap B) \cap C\]

    • Distributive \[A \cup (B \cap C) = (A \cup B) \cap (A \cap C) \quad\text{and}\quad A \cap (B \cup C) = (A \cap B) \cup (A \cap C)\]

    • De Morgan’s Law \[(A \cup B)^c = A^c \cap B^\cap \quad\text{and}\quad (A \cap B)^c = A^c \cup B^c\]

    Exercise 1.3 Use distributive property to simplify:

    • \([1,4] \cap ([0,2] \cup [3,5])\).
    • \([2,4] \cup ([3,5] \cap [1,4])\).
  10. Given a universal set \(Ω\), a collection \(\ALPHABET F = \{F_1, F_2, \dots, F_m\}\) of subsets of \(Ω\) is called an algebra if it satisfies the following properties:

    • \(\emptyset \in \ALPHABET F\) and \(Ω \in \ALPHABET F\).
    • Closed under complements: if \(A \in \ALPHABET F\) then \(A^c \in \ALPHABET F\).
    • Closed under finite unions and finite intersections: if \(A_1, \dots, A_n \in \ALPHABET F\), then \[ A_1 \cup A_2 \cup \cdots \cup A_n \in \ALPHABET F \quad\text{and}\quad A_1 \cap A_2 \cap \cdots \cap A_n \in \ALPHABET F \]
  11. We will sometimes use the notation “\((Ω,\ALPHABET F)\) is an algebra of sets” or “\(\ALPHABET F\) is an algebra on \(Ω\)”. Some examples of algebras are as follows:

    • The smallest algebra associated with \(Ω\) is \(\{\emptyset, Ω\}\).
    • If \(A\) is any subset of \(Ω\), then \(\{\emptyset, A, A^c, Ω\}\) is an algebra.
    • For any set \(Ω\), the power-set \(2^Ω\) is an algebra on \(Ω\). As an illustration, check that the power set defined in Exercise 1.1 is an algebra.
  12. These are all examples of a general principle: The power-set of any partition of a set is an algebra.

    • If the partition is \(\{Ω\}\), then the power-set \(\{ \emptyset, Ω \}\) is an algebra.
    • If the partition is \(\{A, A^c\}\), then the power-set \(\{ \emptyset, A, A^c, Ω\}\) is an algebra.
    • If the partition is the collection of all singleton elements of a set, then the power-set \(2^Ω\) is an algebra.

    We will use the notation \(σ(\ALPHABET D)\) to denote the algebra generated by a partition \(\ALPHABET D\).

  13. The reverse is also true. If \(\ALPHABET F\) is an algebra on a finite space \(Ω\), then there is a unique partition \(\ALPHABET D = \{D_1, \dots, D_m\}\) such that \(\ALPHABET F = 2^{\ALPHABET D}\). The elements \(D_i\) are called atoms of \(\ALPHABET F\).

  14. Let \(\ALPHABET D_1\) and \(\ALPHABET D_2\) be two partitions of \(Ω\). We say \(\ALPHABET D_1\) is finer than \(\ALPHABET D_2\) if \(σ(\ALPHABET D_1) \subseteq σ(\ALPHABET D_2)\). For instance, suppose \(Ω = \{1, 2, 3, 4\}\) and \[ \ALPHABET D_1 = \{ \{1,2,3\}, \{4\} \} \quad\text{and}\quad \ALPHABET D_1 = \{ \{1,2\}, \{3\}, \{4\} \} \] then, \(\ALPHABET D_1\) is finer than \(\ALPHABET D_2\).

  15. If a partition \(\ALPHABET D_1\) is finer than \(\ALPHABET D_2\), we also say that \(\ALPHABET D_2\) is coarser than \(\ALPHABET D_1\).

1.4 Functions

  1. A function \(f\) is a rule that assign each input from a set \(\ALPHABET X\) (called the domain) to exactly one output in another set \(\ALPHABET Y\) (called the co-domain). Symbolically, this is written as \[ f \colon \ALPHABET X \to \ALPHABET Y \] and we say \(f\) maps \(\ALPHABET X\) to \(\ALPHABET Y\).

  2. The set of values \(\{ f(x) : x \in \ALPHABET X\}\) is called the range. By definition, the range is a subset of the co-domain \(\ALPHABET Y\), but the range may or may not be equal to \(\ALPHABET Y\). For example, consider the function \(f \colon \reals \to \reals\) defined by \(f(x) = x^2\). The range of the function is \(\reals_{\ge 0}\) which is a strict subset of the co-domain \(\reals\).

  3. A function is called onto or surjective if its range is equal to its co-domain.

  4. A function is called one-to-one or injective if different inputs maps to different outputs.

  5. A function which is both onto and one-to-one is called bijective. A bijective function is invertible because for every \(y \in \ALPHABET Y\), there is a unique \(x\) such that \(f(x) = y\).

  6. For a set \(B \subset \ALPHABET Y\), the preimage or inverse image of \(B\) under \(f\) is defined as \[ f^{-1}(B) = \{ x \in \ALPHABET X : f(x) \in B \}, \] which is a subset of \(\ALPHABET X\).

Example 1.2 Consider the following functions:

  • \(f_1 \colon \reals \to \reals\)
  • \(f_2 \colon \reals \to \reals_{\ge 0}\)
  • \(f_3 \colon \reals_{\ge 0} \to \reals\)
  • \(f_4 \colon \reals_{\ge 0} \to \reals_{\ge 0}\)

all given by \(f_i(x) = x^2\), \(i \in \{1, \dots, 4\}\).

Explain if each of the function is onto, into, or bijective.

Example 1.3 Let \(f \colon \reals \to \reals\) be given by \(f(x) = 2x + 1\). Find \(f^{-1}([3,7])\).

We need to solve for \(2x + 1 \in [3,7]\), which is equivalent to \[ 3 \le 2x + 1 \le 7 \] which is equivalent to \[ 1 \le x \le 3. \]

Thus, \(f^{-1}([3,7]) = [1,3]\).

1.5 Mathematical model of probability

  1. To mathematically model probability statements, we need to model the sequence of events that may lead to the occurrence of \(A\): this is called a random experiment; the result of an experiment is called an outcome.

  2. In general, the outcome of an experiment is not certain. We can only talk about the collection of possible outcomes. The collection of possible outcomes of an experiment is called the sample space and denoted by \(Ω\).

    Exercise 1.4 What is the sample space for the toss of a coin?

    \(Ω = \{ H, T \}\).

    Exercise 1.5 What is the sample space for the roll of a (6-sided) die?

    \(Ω = \{ 1,2,3,4,5,6 \}\).

  3. An event is any subset of the sample space. If the outcome of the random experiment belongs to the event \(A\), we say that “the event \(A\) has occurred”. Some examples of events are:

    • Head occurs in Exercise 1.4 (\(A = \{H\}\))
    • Both head and tail occur in Exercise 1.4 (\(A = \emptyset\); this is an event that cannot happen, sometimes called the impossible event)
    • An even number is thrown in Exercise 1.5 (\(A = \{2,4,6\}\)).
  4. Note that events are subset of the sample space but not all subsets of a sample space may be events. The reasons are too complicated to explain, but the high-level explanation is that everything is okay for discrete sample spaces, but weird things can happen in continuous sample spaces.

  5. Probability (denoted by \(\PR\)) is a function which assigns a number between \(0\) and \(1\) to every event. This number indicates what is the chance that the event occurs. Such a function should satisfy some axioms, which we will explain below.

  6. First, to define a function, we need to define it’s domain and co-domain Let’s denote the domain (i.e., the set of all events to which we can assign a probability) by \(\ALPHABET F\). We expect probability to satisfy certain properties, which imposes constraints on the domain:

    • Probability of an impossible event (e.g., getting both heads and tails when we toss a coin) should be zero. Thus, \(\emptyset \in \ALPHABET F\).

    • Probability of something happening (e.g., getting either a head or a tail when we toss a coin) should be one. Thus, \(Ω \in \ALPHABET F\).

    • If we assign probability to an event \(A\) then we should be able to assign probability to “\(A\) does not occur” i.e., \(A^c\). Thus, if \(A \in \ALPHABET F\) then \(A^c \in \ALPHABET F\).

    • If we can talk about probability of \(A\) and \(B\), then we should be able to talk about probability that either \(A\) or \(B\) occurs and both \(A\) and \(B\) occur. Thus, if \(A, B \in \ALPHABET F\), then \(A \cup B \in \ALPHABET F\) and \(A \cap B \in \ALPHABET F\).

  7. Thus, the domain of \(\PR\) must be an algebra! However, when we go beyond finite sample spaces, being an algebra is not sufficient. But we first provide some examples of probability for finite sample spaces.

    Example 1.4 (Uniform probability) Consider a finite set \(Ω\) with \(\ALPHABET F = 2^Ω\). The uniform probability \(\PR\) on \(Ω\) is given by \[ \PR(A) = \frac{\ABS{A}}{\ABS{Ω}}, \quad \forall A \in \ALPHABET F \]

    An illustration of uniform probability distribution on the outcomes of a fair coin or a fair dice.

  8. In general, when \(Ω = \{ω_1, \dots, ω_n\}\) is finite and \(\ALPHABET F = 2^{Ω}\), we can think of probability as an assignment of a weight \(p(ω_i)\) to each outcome \(ω_i\) such that

    • \(0 \le p(ω_i) \le 1\)
    • \(p(ω_1) + \cdots + p(ω_n) = 1\).

    Then, for any event \(A \in \ALPHABET F\) (i.e., any \(A \subseteq Ω\), we define \[ \PR(A) = \sum_{i : ω_i \in A } p(ω_i). \] This is effectively how probability was defined in undergraduate probability.

    Example 1.5 Consider a six-sided die where \(Ω = \{1, 2, \dots, 6 \}\), \(\ALPHABET F = 2^Ω\) and \(\PR\) is given by \[ \PR(A) = \sum_{ω \in A} p(ω), \quad \forall A \in \ALPHABET F \] where \[ p(1) = p(2) = p(3) = p(4) = p(5) = \frac 2{15} \quad\text{and}\quad p(6) = \frac{1}{3}. \]

    Verify that

    • \(\PR(Ω) = 1\)
    • \(\PR(\{2,3,5\}) = \frac{6}{15}\).
    • \(\PR(\{1,3,4,5\}) = \frac{8}{15}\).
  9. Example 1.5 can be visualized as follows. First we visualize the probability space as in Figure 1.2.

    Figure 1.2: Probability space of Example 1.5

    To compute \(\PR(\{2, 3, 5\})\), we look at the event \(A = \{2, 3, 5\}\) and count the probability of all the cells inside \(A\), as shown in Figure 1.3.

    Figure 1.3: Computing \(\PR(A)\)
  10. We now present an example to illustrate that restricting the domain of \(\PR\) to be an algebra is not sufficient.

    Example 1.6 A coin is tossed repeatedly until a head turns up. The sample space is \(Ω = \{ω_1, ω_2, \dots\}\) where \(ω_n\) denotes the event that the first \(n-1\) tosses are tails followed by a head.

    Suppose we are interested in finding the probability of the event that the coin is tossed an even number of times, i.e., \(A = \{ω_2, ω_4, \dots\}\). Note that \(ω_2, ω_4, \dots \in \ALPHABET F\). However, \(A\) is a set. If we want to assign probability to \(A\) in terms of probability of \(ω_n\), we require \(\ALPHABET F\) to be closed under countable unions. This motivates the following definition.

  11. Given a universal set \(Ω\), a collection \(\ALPHABET F = \{F_1, F_2, \dots\}\) of subsets of \(Ω\) is called a \(\boldsymbol{σ}\)-algebra if it satisfies the following properties:

    • \(\emptyset \in \ALPHABET F\) and \(Ω \in \ALPHABET F\).
    • Closed under complements: if \(A \in \ALPHABET F\) then \(A^c \in \ALPHABET F\).
    • Closed under countable unions: if \(A_1, A_2, \dots \in \ALPHABET F\), then \[ \bigcup_{n=1}^∞ A_n \in \ALPHABET F \]
  12. The distinction between algebras and \(σ\)-algebras is technical. The reason that we need to consider \(σ\)-algebras is to do with the definition of probability on continuous sample spaces. Take \(Ω = [0,1]\) and consider a random experiment where “any outcome is equally likely”. Intuitively we capture this feature by assuming that for any interval \([a,b]\) with \(0 \le a \le b \le 1\), we have \[\begin{equation}\label{eq:uniform} \PR([a,b]) = b - a. \end{equation}\]

    We have seen that if we want \(\PR\) to be a meaningful measure, the domain \(\ALPHABET F\) must at least be an algebra. We have also seen that the power-set \(2^Ω\) is always an algebra. So, it is tempting to take \(\ALPHABET F = 2^{[0,1]}\). However, it turns out that \(2^{[0,1]}\) includes some weird sets (technically, non-measurable sets) due to which we cannot define a function \(\PR\) on \(2^{[0,1]}\) that satisfies \(\eqref{eq:uniform}\).

    To workaround this technical limitation, we revisit the minimum requirements that we need from the domain of \(\PR\). Since we are interested in \(\PR([a,b])\), \(\ALPHABET F\) must contain intervals (and therefore all finite unions and intersections of intervals). Since we are working with continuous sample spaces, we also want \(\PR\) to be continuous, i.e., for any sequence of sets \(\{A_n\}_{n \ge 1}\), we want \(\PR(\lim_{n \to ∞} A_n) = \lim_{n \to ∞} \PR(A_n)\). It turns out that the additional requirement of continuity implies that \(\ALPHABET F\) must be closed under countable unions as well. Thus, the domain \(\ALPHABET F\) must at least be a \(σ\)-algebra.

    So, we restrict to the simplest choice of the domain \(\ALPHABET F\) needed for \(\eqref{eq:uniform}\) and continuity to hold. For technical reasons, we need another property known as completeness. See Sec. 1.6 of the textbook.

  13. Tip\(σ\)-algebra generated by a collection and Borel \(σ\)-algebra

    Given collection \(\ALPHABET S\) of subsets of \(Ω\), we have the following:

    • The power-set \(2^Ω\) contains \(\ALPHABET S\). Therefore, there is at least one \(σ\)-algebra containing \(\ALPHABET S\).
    • If \(\ALPHABET F_1\) and \(\ALPHABET F_2\) are \(σ\)-algebras containing \(\ALPHABET S\), then \(\ALPHABET F_1 \cap \ALPHABET F_2\) also contains \(\ALPHABET S\).

    Thus, if we take the intersection of all \(σ\)-algebras containing \(\ALPHABET S\), we get the smallest \(σ\)-algebra containing \(\ALPHABET S\), which is sometimes denoted by \(σ(\ALPHABET S)\).

    One commonly used \(σ\)-algebra is the Borel \(σ\)-algebra, which is defined as follows. Let \(Ω\) be a subset of \(\reals\) and \(\ALPHABET S\) be the collection of all open intervals in \(Ω\). Then \(σ(\ALPHABET S)\) is called the “Borel \(σ\)-algebra on Ω” and often denoted by \(\mathscr{B}(Ω)\).

    Note: Borel \(σ\)-algebra is usually defined for any topological space. We restrict our definition to subsets of reals.

  14. A pair \((Ω, \ALPHABET F)\) where \(Ω\) is a set and \(\ALPHABET F\) is a \(σ\)-algebra on \(Ω\) is called a measurable space.

  15. Definition 1.1 (Probability) Given a measurable space \((Ω, \ALPHABET F)\), probability \(\PR \colon \ALPHABET F \to [0,1]\) is a function that satisfies the following axioms of proability

    1. Non-negativity. \(\PR(A) \ge 0\).
    2. Normalization. \(\PR(Ω) = 1\).
    3. Countable additivity. If \(A_1, A_2, \dots Ω\) is a collection of disjoint events in \(\ALPHABET F\), then, \[ \PR\biggl( \bigcup_{n=1}^∞ A_n \biggr) = \sum_{n=1}^∞ \PR(A_n). \]

    The collection \((Ω, \ALPHABET F, \PR)\) is called a probability space.

  16. Some immediate implications of the axioms of probability are the following.

    Lemma 1.1 (Properties of probability measures)  

    1. Probability of complement. \(\PR(A^c) = 1 - \PR(A)\).

    2. Monotonicity. If \(A \subset B\), then \(\PR(B) = \PR(A) + \PR(B \setminus A) \ge \PR(A)\).

    3. Inclusion-exclusion. Given two events \(A\) and \(B\), \[ \PR(A \cup B) = \PR(A) + \PR(B) - \PR(A \cap B). \]

    4. Continuity. Let \(A_1, A_2, \dots\) be (weakly) increasing sequence of events, i.e., \(A_1 \subseteq A_2 \subseteq A_3 \subseteq \cdots\). Define \[ A = \lim_{n \to ∞} A_n = \bigcup_{n=1}^∞ A_n. \] Then, \[ \PR(A) = \lim_{n \to ∞} \PR(A_n). \]

      Similarly, let \(B_1, B_2, \dots\) be (weakly) decreasing sequence of events, i.e., \(B_1 \supseteq B_2 \supseteq B_3 \supseteq \cdots\). Define \[ B = \lim_{n \to ∞} B_n = \bigcup_{n=1}^∞ B_n. \] Then, \[ \PR(B) = \lim_{n \to ∞} \PR(B_n). \]

    5. Union bound. For any sequence of events \(\{A_n\}_{n \ge 1}\), we have \[ \PR\biggl( \bigcup_{n=1}^{∞} A_n \biggr) \le \sum_{n=1}^{∞} \PR(A_n). \]

    The proof of parts (a)–(c) is elementary and left as an exercise. Part (d) is more technical and is essentially equivalent to countable additivity. See the textbook for a proof. Union bound is an immediate consequence of inclusion-exclusion and continuity.

  17. Let \(Ω = [0,1]\), \(\ALPHABET F = \mathscr B[0,1]\), and \(\PR\) be any probability measure on \((Ω, \ALPHABET F)\). Take any \(a \in (0,1)\).

    • Consider \(A_n = \bigl[0, a - \frac 1n \bigr)\). Then \(A = \lim_{n \to ∞} A_n = [0, a)\).
    • Consider \(A_n = \bigl[0, a - \frac 1n \bigr]\). Then \(A = \lim_{n \to ∞} A_n = [0, a)\).
    • Consider \(B_n = \bigl[0, 1 + \frac 1n \bigr)\). Then \(B = \lim_{n \to ∞} B_n = [0,a]\).
    • Consider \(B_n = \bigl[0, 1 + \frac 1n \bigr]\). Then \(B = \lim_{n \to ∞} B_n = [0,a]\).

    In these examples, continuity implies that \(\PR(A) = \lim_{n \to ∞} \PR(A_n)\) and \(\PR(B) = \lim_{n \to ∞} \PR(B_n)\).

  18. TipSome terminology
    • An event \(A\) is called null if \(\PR(A) = 0\). Null event should not be confused with impossible event \(\emptyset\).

    • We say that \(A\) occurs almost surely (abbreviated to a.s.) if \(\PR(A) = 1\).

  19. Consider \(Ω = [0,1]\), \(\ALPHABET F = \mathscr B([0,1])\), and \(\PR\) to be the uniform probability distribution on \(Ω\). Consider the event \(A\) that the outcome is a rational number. \(A\) is a countable set (because the set of rational numbers is countable). For any \(x \in A\), \(\{x\} \in \ALPHABET F\), and \(\PR(\{x\}) = 0\) (we can infer this from the previous exercise by thinking of \(\{x\}\) as the limit of intervals \(\bigl[x, x+ \frac 1n\bigr]\)). Thus, by countable additivity, \(\PR(A) = 0\). Hence, \(A\) is null.

    The above analysis implies that \(\PR(A^c) = 1\), thus the event that the outcome is irrational occurs almost surely.

Exercise 1.6 Consider a probability space \((Ω, \ALPHABET F, \PR)\) and events \(A, B \in \ALPHABET F\). Using axioms of probability, show that:

  • If \(B \subseteq A\), then \(\PR(A \setminus B) = \PR(A) - \PR(B)\). Hence, argue that \(\PR(A) \ge \PR(B)\).

  • \(\PR(A \cup B) = \PR(A) + \PR(B) - \PR(A \cap B)\).

  • \(\PR(A) = \PR(A \cap B) + \PR(A \cap B^c)\).

Exercise 1.7 Consider a probability space \((Ω, \ALPHABET F, \PR)\) and events \(A, B, C \in \ALPHABET F\). Prove that \[ \PR(A \cup B \cup C) = 1 - \PR(A^c \mid B^c \cap C^c) \PR(B^c \mid C^c) \PR(C^c). \]

1.6 Conditional Probability

  1. Conditional probabilities quantify the uncertainty of an event when it is known that another event has occurred

    Definition 1.2 Let \((Ω,\ALPHABET F, \PR)\) be a probability space and \(A, B \in \ALPHABET F\) such that \(\PR(B) > 0\). Then, the conditional probability that \(A\) occurs given that \(B\) occurs is defined as \[ \PR(A | B) = \dfrac{ \PR(A \cap B) }{ \PR(B) }. \]

    The notation \(\PR(A | B)\) is read as “probability of \(A\) given \(B\)” or “probability of \(A\) conditioned on \(B\)”.

    Exercise 1.8 Suppose we roll a fair six-sided die (a fair die means that all outcomes are equally likely). Consider the events \(A\) that the outcomes is prime and \(B\) that the outcome is a multiple of \(3\). Compute \(\PR(A | B)\) and \(\PR(B | A)\).

    We have \(Ω = \{1, 2, 3, 4, 5, 6\}\), \(A = \{2, 3, 5\}\), and \(B = \{3, 6\}\). Note that \(\PR(A) = \frac 12\) and \(\PR(B) = \frac 13\).

    Thus, \[ \PR(A | B) = \frac{ \PR(A \cap B) }{ \PR(B) } = \frac{ \PR(\{3\}) }{ \PR(\{3,6\}) } = \frac{ \ABS{\{3\}} }{ \ABS{\{3,6\}} } = \frac {1}{2}. \] Similarly, \[ \PR(B | A) = \frac{ \PR(B \cap A) }{ \PR(A) } = \frac{ \PR(\{3\}) }{ \PR(\{2,3,5\}) } = \frac{ \ABS{\{3\}} }{ \ABS{\{2,3,5\}} } = \frac {1}{3}. \]

    Exercise 1.9 Suppose we roll two fair six-sided dice. Consider the event \(A\) that the maximum of the two rolls is less than or equal to \(8\) and the event \(B\) that the minimum of the two rolls is greater than or equal to \(6\). Compute \(\PR(A|B)\) and \(\PR(B|A)\).

    Note that \(Ω = \{ 1, 2,3, 4, 5, 6\}^2\) and \(\PR\) is uniform probability on all outcomes. The sets \(A\), \(B\), and \(A \cap B\) are shown in Figure 1.4. Note that \(\PR(A) = \PR(B) = \frac{26}{36} = \frac{13}{18}\).

    Figure 1.4: The different events in Exercise 1.9

    Thus, we have \[ \PR(A|B) = \frac{ \PR(A \cap B) } { \PR(B) } = \frac{ \ABS{ A \cap B} }{ \ABS{B} } = \frac{16}{26} = \frac{8}{13} \] and \[ \PR(B|A) = \frac{ \PR(B \cap A) } { \PR(A) } = \frac{ \ABS{ B \cap A} }{ \ABS{A} } = \frac{16}{26} = \frac{8}{13} \]

  2. TipConditional probabilities are probabilities

    Conditional probabilities are legitimate probability measures on \((Ω, \ALPHABET F)\). In particular, fix event \(B\) with \(\PR(B) > 0\). Then

    • \(\PR(A \mid B) \ge 0\).
    • \(\PR(Ω \mid B) = \dfrac{\PR(Ω \cap B)}{\PR(B)} = 1\).
    • For disjoint events \(A_1, A_2 \in \ALPHABET F\), \[\PR(A_1 \cup A_2 \mid B) = \frac{ \PR( (A_1 \cup A_2) \cap B) }{ \PR(B) } = \frac{ \PR( (A_1 \cap B) \cup (A_2 \cap B) ) }{ \PR(B) } = \frac{ \PR(A_1 \cap B) + \PR (A_2 \cap B) }{ \PR(B) } = \PR(A_1 \mid B) + \PR(A_2 \mid B)\] where we have used the fact that \((A_1 \cap B)\) and \((A_2 \cap B)\) are disjoint.

    Exercise 1.10 Given an event \(B\) with \(\PR(B) > 0\), show that

    • \(\PR(A^c | B) = 1 - P(A|B)\).
    • \(\PR(A_1 \cup A_2 | B) = \PR(A_1 | B) + \PR(A_2 | B) - \PR(A_1 \cap A_2 | B)\).
    • If \(A_1 \subset A_2\) then \(\PR(A_1 | B) \le \PR(A_2 | B)\).
  3. The definition of conditional probability gives rise to the chain rule.

    Lemma 1.2 (Chain rule of probability) Let \(A\) and \(B\) be events in a probability space \((Ω, \ALPHABET F, \PR)\).

    • If \(\PR(B) > 0\), then \(\PR(A \cap B) = \PR(A | B) \PR(B)\).
    • If \(\PR(A) > 0\), then \(\PR(A \cap B) = \PR(B | A) \PR(A)\).
  4. Combining the chain rule with the basic properties of partitions, we get the law of total probability.

    Lemma 1.3 (Law of total probability) Let \(\{B_1, B_2, \dots, B_m\}\) be a partition of \(Ω\) such that \(\PR(B_i) > 0\) for all \(i\). Then, \[ \PR(A) = \sum_{i=1}^m \PR(A \cap B_i) = \sum_{i=1}^m \PR(A | B_i) \PR(B_i). \]

    Figure 1.5: Illustration of Law of total probability

    Consider \(m=2\), in which case the result can be simplified as \[\PR(A) = \PR(A|B)\PR(B) + \PR(A|B^c) \PR(B^c).\]

    To prove this observe that \[\begin{equation}\label{eq:two-step} A = A \cap (B \cup B^c) = (A \cap B) \cup (A \cap B^c). \end{equation}\] The events \(A \cap B\) and \(A \cap B^c\) are disjoint. Therefore, by additivity, we have \[ \PR(A) = \PR(A \cap B) + \PR(A \cap B^c). \] Then, by the definition of conditional probability, we have \(\PR(A \cap B) = \PR(A|B) \PR(B)\) and \(\PR(A \cap B^c) = \PR(A|B^c) \PR(B^c)\). Substituting in the above, we get \(\eqref{eq:two-step}\).

    The argument for the general case is similar.

    Exercise 1.11 There are two routes for a packet to be transmitted from a source to the destination.

    • The packet takes route \(R_1\) with probability \(\frac 34\) and takes route \(R_2\) with probability \(\frac 14\).
    • On route \(R_1\), the packet is dropped with probability \(\frac 13\).
    • On route \(R_2\), the packet is dropped with probability \(\frac 14\).

    Find the probability that the packet reaches the destination.

    We start by defining some events: Let \(R_1\) denote the event that the packet took route \(R_1\) and \(R_2\) denote the event that the packet took route \(R_2\). Let \(D\) denote the event that the packet was dropped.

    Then, the information given in the question can be written as:

    • \(\PR(R_1) = \frac 34\) and \(\PR(R_2) = \frac 14\).
    • \(\PR(D | R_1) = \frac 13\). Thus, \(\PR(D^c | R_1) = 1 - \PR(D | R_1) = \frac 23\).
    • \(\PR(D | R_2) = \frac 14\). Thus, \(\PR(D^c | R_2) = 1 - \PR(D | R_2) = \frac 34\).

    Then, by the law of total probability, we have \[\begin{align*} \PR(D^c) &= \PR(D^c | R_1) \PR(R_1) + \PR(D^c | R_2) \PR(R_2) \\ &= \frac 34 \frac 23 + \frac 14 \frac 34 = \frac {11}{16}. \end{align*}\]

  5. Lemma 1.4 (Bayes rule) For any events \(A, B \in \ALPHABET F\) such that \(\PR(A), \PR(B) > 0\), we have \[ \PR(B|A) = \dfrac{\PR(A|B)\PR(B)}{\PR(A)}. \]

    In general, if \(\{B_1, B_2, \dots, B_m\}\) is a partition of \(Ω\) such that \(\PR(B_i) > 0\) for all \(i\). Then, \[ \PR(B_i|A) = \dfrac{ \PR(A|B_i) \PR(B_i) } {\displaystyle \sum_{j=1}^m \PR(A|B_j) \PR(B_j)} \] where we have used the law of total probability (Lemma 1.3) in the denominator.

    Exercise 1.12 Consider the model of Exercise 1.11. Suppose we know that the packet was dropped. What is the probability that it was transmitted via route \(R_1\)?

    Recall the events \(R_1\), \(R_2\), and \(D\) defined in the solution of Exercise 1.11. We were given that \[\PR(R_1) = \frac 34, \quad \PR(R_2) = \frac 14, \quad \PR(D|R_1) = \frac 13, \quad \PR(D|R_2) = \frac 14.\] We had compute that \[ \PR(D) = 1 - \PR(D^c) = \frac 5{16}. \]

    Thus, by Bayes rule, we have \[ \PR(R_1 | D) = \frac{ \PR(D | R_1) \PR(R_1) }{ \PR(D) } = \frac{ \frac 13 \frac 34 } { \frac 5{16} } = \frac 45. \]

1.7 Independence

  1. In general, the knowledge that an event \(B\) has occurred changes the probability of event \(A\), since \(\PR(A)\) is replaced by \(\PR(A|B)\). If the knowledge that \(B\) has occurred does not does not change our belief about \(A\), i.e., when \(\PR(A|B) = \PR(A)\), we say “\(A\) and \(B\) are independent”. This leads to the following definition.

    Definition 1.3 The events \(A, B \in \ALPHABET F\) are called independent if \[ \PR(A|B) = \PR(A) \quad\text{or}\quad \PR(B|A) = \PR(B). \] An alternative but equivalent definition is \[ \PR(A \cap B) = \PR(A) \PR(B). \]

    We will use the notation \(A \independent B\) to denote that the events \(A\) and \(B\) are independent.

    Example 1.7 The events \(A\) and \(B\) defined in Exercise 1.8 are independent.

    Example 1.8 The events \(A\) and \(B\) defined in Exercise 1.9 are not independent.

    Exercise 1.13 \(A \independent B\) implies the following:

    • \(A \independent B^c\).
    • \(A^c \independent B\).
    • \(A^c \independent B^c\).
  2. Independence is what separates probability theory from the more general measure theory.

  3. It is common for students to make the mistake and think that independence means \(A \cap B = \emptyset\). This is not true!

  4. TipIndependence of \(σ\)-algebras

    In the discussion below, we assume that the probability space \((Ω, \ALPHABET F, \PR)\) is fixed.

    1. Two sub-\(σ\)-algebras \(\ALPHABET F_1\) and \(\ALPHABET F_2\) of \(\ALPHABET F\) are said to be independent if every event \(A_1 \in \ALPHABET F_1\) is independent of every event \(A_2 \in \ALPHABET F_2\).

    2. For any event \(A\), let \(σ(A)\) denote the smallest \(σ\)-algebra containing \(A\), i.e., \(σ(A) = \{\emptyset, A, A^c, Ω\}\).

    3. Exercise 1.13 implies that independence of \(A\) and \(B\) implies the independence of \(σ(A)\) and \(σ(B)\). The reverse implication is trivially true. Thus, independence of events is equivalent to the independence of the smallest \(σ\)-algebra containing those events.

  5. Definition 1.4 A family of events \(\{A_1, A_2, \dots, A_n\}\) is called independent (sometimes mutually independent if for all non-empty subset of indices \(\{k_1, \dots, k_m\} \subset \{1,\dots,n\}\), we have \[ \PR\bigl( A_{k_1} \cap A_{k_2} \cap \cdots \cap A_{k_m} \bigr) = \PR(A_{k_1}) \PR(A_{k_2}) \cdots \PR(A_{k_m}). \]

    Exercise 1.14 Three bits are transmitted over a noisy channel. For each bit, the probability of correct reception is \(λ\). The error events for the three transmissions are mutually independent. Find the probability that two bits are received correctly.

    For \(i \in \{1, 2, 3\}\), let

    • \(E_i\) denote the event that bit \(i\) is received incorrectly
    • \(C_i\) denote the event that bit \(i\) is received correctly

    Moreover let \(S\) denote the event that two bits are received correctly. Then, \[ S = (C_1 \cap C_2 \cap E_3) \cup (C_1 \cap E_2 \cap C_3) \cap (E_1 \cap C_2 \cap C_3). \] Note that the three events in the right hand side are disjoint. Thus, \[\begin{align*} \PR(S) &= \PR(C_1 \cap C_2 \cap E_3) + \PR(C_1 \cap E_2 \cap C_3) + \PR(E_1 \cap C_2 \cap C_3) \\ &= \PR(C_1)\PR(C_2)\PR(E_3) + \PR(C_1)\PR(E_2)\PR(C_3) + \PR(E_1)\PR(C_2)\PR(C_3) \\ &= 3 (1-λ) λ^2. \end{align*}\]

  6. WarningPairwise independence vs independence

    A family of events \(\{A_1, A_2, \dots, A_n\}\) is pairwise independent if for every \(i,j \in \{1, \dots, n\}\), \(i \neq j\), we have \[ \PR(A_i \cap A_j) = \PR(A_i) \PR(A_j). \]

    Pairwise independence is weaker than Independence. For instance, three events \(A\), \(B\), and \(C\) are pairwise independent if \[ \PR(A \cap B) = \PR(A) \PR(B), \quad \PR(B \cap C) = \PR(B) \PR(C), \quad\text{and}\quad \PR(C \cap A) = \PR(C) \PR(A). \] For independence, in addition to the above, we also need \[ \PR(A \cap B \cap C) = \PR(A)\PR(B) \PR(C). \]

    The following example illustrates shows that independence is stronger than pairwise independence. Consider an urn with \(M\) red balls and \(M\) blue balls. Two balls are drawn at random, one at a time, with replacement. Consider the following events:

    • \(A\) is the event that the first ball is red.
    • \(B\) is the event that the second ball is blue.
    • \(C\) is the event that both balls are of the same color.

    Observe that

    • \(A \cap B\) is the event that the first is red and second is blue.
    • \(B \cap C\) is the event that both balls are blue.
    • \(C \cap A\) is the event that both balls are red.
    • \(A \cap B \cap C = \emptyset\)

    Therefore,

    • \(\PR(A) = \PR(B) = \PR(C) = \frac 12\).
    • \(\PR(A \cap B) = \PR(B \cap C) = \PR(C \cap A) = \frac 14\).
    • \(\PR(A \cap B \cap C) = \emptyset\).

    Thus, \(A\), \(B\), \(C\) are pairwise independent but not independent.

1.8 Product spaces

  1. So far, we have restricted attention to the outcome of one experiment. It is also possible to construct probability models which combine the outcome of two independent experiments, e.g., suppose we toss a coin and also roll a die. Let \((Ω_1, \ALPHABET F_1, \PR_1)\) and \((Ω_2, \ALPHABET F_2, \PR_2)\) be the probability spaces associated with the two experiments? What is the probability space \((Ω, \ALPHABET F, \PR)\) of the joint experiments?

  2. The sample space should obviously be \(Ω = Ω_1 \times Ω_2\). When \(\ALPHABET F_1\) and \(\ALPHABET F_2\) are finite, then we can simply define \(\ALPHABET F = \ALPHABET F_1 \times \ALPHABET F_2\) and for any \(A = (A_1, A_2) \in \ALPHABET F\), \(\PR(A)\) to be \(\PR_1(A_1) \PR_2(A_2)\). Note that \(\PR(Ω) = \PR_1(Ω_1)\PR_2(Ω_2) = 1\), thus \((Ω, \ALPHABET F, \PR)\) is a valid probability space. This space is called direct product of probability spaces \((Ω_1, \ALPHABET F_1, \PR_1)\) and \((Ω_2, \ALPHABET F_2, \PR_2)\).

  3. You would have implicitly constructed such product spaces when dealing with joint experiments (like the coin toss and die roll example above) in your undergrad courses. Such constructions were correct because of the following.

  4. Given \(A_1 \in \ALPHABET F_1\) and \(A_2 \in \ALPHABET F_2\), define the events \[ B_1 = \{ ω = (ω_1, ω_2) \in Ω : ω_1 \in A_1 \} \quad\text{and}\quad B_2 = \{ ω = (ω_1, ω_2) \in Ω : ω_2 \in A_2 \}. \] Then, \[ \PR(B_1 \cap B_2) = \PR_1(A_1) \PR_2(A_2) = \PR(B_1) \PR(B_2). \] Thus, events \(B_1\) and \(B_2\) are independent.

  5. It is a bit more complicated when \(\ALPHABET F_1\) and \(\ALPHABET F_2\) are not finite. The difficulty is that \(\ALPHABET F_1 \times \ALPHABET F_2\) is not a \(σ\)-algebra. So, take \(\ALPHABET F\) to be \(σ(\ALPHABET F_1 \times \ALPHABET F_2)\) (which is the smallest \(σ\)-algebra comtaining \(\ALPHABET F_1 \times \ALPHABET F_2\)) and define \(\PR\) to be the extension of \(\PR_1 \times \PR_2\) from \(\ALPHABET F_1 \times \ALPHABET F_2\) to \(σ(\ALPHABET F_1 \times \ALPHABET F_2)\) (one can show that such an extension exists). Such product space is often written as \[ (Ω, \ALPHABET F, \PR) = (Ω_1 \times Ω_2, \ALPHABET F_1 \otimes \ALPHABET F_2, \PR_1 \otimes \PR_2). \]

We will not worry too much about the technical details of such product spaces, but will use the above notation at times in the course.

Notes

The material for this section is fairly standard and adapted from various sources. Both Grimmit and Stirzaker and Gubner have an excellent coverage of this material. The idea of categorizing random phenomenon as noise, uncertainty, or randomness is borrowed from Maxim Raginsky’s course notes. See the book Against the Gods, which provides a fascinating historical account of the development of probability theory and, why a conceptual leap was needed to recognize that uncertainty could be modeled and measured.