Conditional probability and conditional expectation

Updated

November 11, 2024

Conditional probability is perhaps the most important aspect of probability theory as it explains how to incorporate new information in a probability model. However, formally defining conditional probability is a bit intricate. In the notes, I will first provide an intuitive high-level explanation of conditional probabability. We will then do a deeper dive, trying to develop a bit more intuition about what is actually going on.

1 Conditioning on events

  1. Recall that conditional probability for events is defined as follows: given a probability space (Ω,F,P) and events A,BF such that P(B)>0, we have P(AB)=P(AB)P(B).

  2. Building on this definition, we can define the conditional CDF of a random variable X conditioned on an event C (such that P(C)>0) as follows: FX|C(xC)=P(XxC)=P({Xx}C)P(C).

  3. As we pointed out, conditional probabilities are probabilities, the conditional CDF defined above satisfies the properties of regular CDFs. In particular

    • 0FX|C(xC)1
    • limxFX|C(xC)=0
    • limx+FX|C(xC)=1
    • FX|C(xC) is non-decreasing function.
    • FX|C(xC) is right-continuous function.
  4. Since FX|C is CDF, we can classify random variables conditioned on an event as discrete or continuous in the usual way. In particular

    • If FX|C is piecewise constant, then X conditioned on C is a discrete random variable which takes values in a finite or countable subset range(XC)={x1,x2,} of R. Furthermore, X conditioned on C has a conditional PMF pX|C:R[0,1] defined as pX|C(xC)=FX|C(xC)FX|C(xC).

    • If FX|C is continuous, then X conditioned on C is a continuous random variable which has a conditional PDF fX|C given by fX|C(xC)=ddxFX(xC).

    • If FX|C is neither piecewise constant nor continuous, then X conditioned on C is a mixed random variable.

    Therefore, a random variable conditioned on an event behaves exactly like a regular random variable. We can define conditional expectation E[XC], conditional variance var(XC) in the obvious manner.

  5. An immediate implication of the law of total probability is the following.

    If C1,C2,,Cn is a partition of Ω, then FX(x)=i=1nFX|Ci(xCi)P(Ci). Furthermore, if X and X conditioned on C are both discrete, we have pX(x)=i=1npX|Ci(xCi)P(Ci) and if X and X conditioned on C are both continuous, we have fX(x)=i=1nfX|Ci(xCi)P(Ci).

Exercise 1 Consider the following experiment. A fair coin is tossed. If the outcome is heads, X is a uniform [0,1] random variable. If the outcome is tails, X is Bernoulli(p) random variable. Find FX(x).

Example 1 (Memoryless property of geometric random variable) Let Xgeometric(p) and m,n be positive integers. Compute P(X>n+mX>m).

Recall that the PMF of a geometric random variable is PX(k)=p(1p)k1,kN. Therefore, P(X>m)=k=m+1PX(k)=k=m+1p(1p)k1=(1p)m.

Now consider P(X>m+nX>m)=P({X>m+n}{X>m})P(X>m)=P(X>m+n)P(X>m)=(1p)m+n(1p)m=(1p)n=P(X>n).

This is called the memoryless property of a geometric random variable.

Example 2 (Memoryless property of exponential random variable) Let XExponential(λ) and t,s be positive reals. Compute P(X>t+sX>t).

Recall that the PDF of an exponential random variable is fX(x)=λeλx,x0. Therefore, P(X>t)=tfX(x)dx=eλt.

Now consider P(X>t+sX>t)=P({X>t+s}{X>t})P(X>t)=P(X>t+s)P(X>t)=eλ(t+s)eλt=eλs=P(X>s).

This is called the memoryless property of a exponential random variable.

2 Conditioning on random variables

We first start with the case where we are conditioning on discrete random variables.

  1. If X and Y are random variables defined on a common probability space and Y is discrete, then FX|Y(xy)=P(XxY=y) for any y such that P(Y=y)>0.

  2. If X is also discrete, the conditional PMF pX|Y is defined as pX|Y(x|y)=P(X=xY=y)=pX,Y(x,y)PY(y) for any y such that P(Y=y)>0.

    Moreover, we have that pX|Y(xy)=FX|Y(xy)FX|Y(xy).

  3. The above expression can be written differently to give the chain rule for random variables: PX,Y(x,y)=PY(y)PX|Y(x|y).

  4. For any event B in B(R), the law of total probability may be written as P(XB)=xrange(X)P(XBX=x)PX(x).

  5. If X is independent of Y, we have pX|Y(xy)=pX(x),x,yR.

We now consider the case when we are conditioning on a continuous random variable.

  1. If Y is continuous, P(Y=y)=0 for all y. We may think of FX|Y(xy)=limδ0P(XxyYy+δ).

  2. When X and Y are jointly continuous, we define the conditional PDF fX|Y(xy)=fX,Y(x,y)fY(y).

    Note that the conditional PDF cannot be interpreted in the same manner as the conditional PMF because it gives the impression that we are conditioning on a zero-probability event. However, we can view it as a limit as follows:

    FX|Y(xy)=limδ0P(XxyYy+δ)=limδ0P(Xx,yYy+δ)P(yYy+δ)=limδ0xyy+δfX,Y(u,v)dvduyy+δfY(v)dv If δ is small, we can approximate yy+δfY(v)dvfY(y)δ and yy+δfX,Y(u,v)dvfX,Y(u,y)δ. Substituting in the above equation, we get FX|Y(xy)limδ0xfX,Y(u,y)δdufY(y)δ=x[fX,Y(u,y)fY(y)]du. Thus, when X and Y are jointly continuous, we have fX|Y(xy)=ddxFX|Y(xy)=fX,Y(x,y)fY(y).

    The formal definition of conditional densities requires some ideas from advanced probability theory, which we will not cover in this course. Nonetheless, I will try to explain the intuition behind the formal definitions in the next section.

  3. The above expression may be written differently to give the chain rule for random variables: fX,Y(x,y)=fY(y)fX|Y(xy).

  4. For any event BB(R), the law of total probability may be written as P(XB)=P(XBY=y)fY(y)dy An immediate implication of this is FX(x)=P(Xx)=P(XxY=y)fY(y)dy=FX|Y(x|y)fY(y)dy.

  5. If X is independent of Y, we have fX|Y(xy)=fX(x),x,yR.

  6. We can show that conditional PMF and conditional PDF satisfy all the properties of PMFs and PDFs. Therefore, we can define conditional expectation E[g(X)Y=y] in terms of pX|Y or fX|Y. We can similarly define conditional variance

Example 3 Suppose X and Y are jointly continuous random variables with the joint PDF fX,Y(x,y)=ex/yeyy,0<x<,0<y<. Find fX|Y?

We first compute the marginal fY(y).

fY(y)=fX,Y(x,y)dx=0ex/yeyydx=eyy0ex/ydx=ey. Thus, fX|Y(xy)=fX,Y(x,y)fY(y)=ex/yy,0<x<,0<y<.

Example 4 Suppose XUniform[0,1] and given X=x, Y is uniformly distributed on (0,x). Find the PDF of Y.

We will use the law of total probability. FY(y)=FY|X(yx)fX(x)dx=01FY|X(yx)dx where we have used the fact that fX(x)=1 for x[0,1]. Now, we know that given X=x, Yuniform[0,x]. Therefore, fY|X(yx)=1x,0<y<x. Therefore, FY|X(yx)={0y<0yx0<y<x1yx

We will compute FY(y) for the three cases separately.

  • For y<0, FY(y)=01FY|X(y|x)dx=0.

  • For 0<y<1, FY(y)=0y1dx+y1yxdx=yylny.

  • For y>1, FY(y)=011dx=1.

Thus, FY(y)={0y<0yylny0y<11y>1.

Hence, fY(y)=dFY(y)dy=lny,0<y<1.

3 Conditional expectation

Define ψ(y)=E[XY=y]. In particular, if X and Y are discrete, then ψ(y)=xrange(X)xpX|Y(x|y) and, if X and Y are continuous, then ψ(y)=xfX|Y(x|y)dx Let Z=ψ(Y). Then Z is a random variable! This is a subtle point, and we will spend some time to develop an intuition of what this means.

3.1 Conditioning on a σ-algebra

The key idea is conditioning on a σ-algebra. To avoid technical subtleties, we restrict to the discrete case.

  1. Consider a probability space (Ω,F,P) where F is a finite σ-algebra. Let G be a sub-σ-algebra of F. In particular, we assume that there is a partition {D1,D2,,Dm} of Ω such that G=2{D1,,Dm}. The elements D1,,Dm are called the atoms of the σ-algebra G.

TODO: Add example. 4x4 grid. partition for G.

  1. We define P(AG) (which we will write as E[1AG] as E[1AG](ω)=i=1mE[1ADi]1Di(ω). Thus, on each ωDi, the value of E[IAG] is equal to E[IADi].

  2. This idea can be extended to any random variable instead of 1A, that is, for any random variable X E[XG](ω)=i=1mE[XDi]1Di(ω). Thus, on each ωDi, the value of E[XG] is equal to E[XDi].

  3. When G={,Ω} is the trivial σ-algebra, E[X{,Ω}]=E[X].

  4. When X=1A, E[1AG]=P(AG).

  5. If X1 and X2 are joint random variables and a1 and a2 are constants, then E[a1X1+a2X2G]=a1E[X1G]+a2E[X2G].

  6. If Y is another random variable which is G measurable (i.e., Y takes constant values on the atoms of G), then E[XYG]=YE[XG].

    [The result can be proved pictorially.]

3.2 Smoothing property of conditional expectation

Let HGF, where means sub-σ-algebra. Let {E1,,Ek} denote the partition corresponding to H and {D1,,Dm} denote the partition corresponding to G. The fact that H is a sub-σ-algebra of G means that each atom Ei of H is a union of atoms of G (i.e., {D1,,Dm} is a refinement of the partition {E1,,Ek}). Therefore, (smoothing-property)E[E[XG]H]=E[XH]. This is known as the smoothing property of conditional expectation.

A special case of the above property is that E[E[XG]]=E[X]. where we have taken H={,Ω} as the trivial σ-algebra. Observe that the above definition is equivalent to the law of total probability!

3.3 Conditioning on random variable

  1. Now suppose Y is a discrete random variable, then P(AY) and E[XY] may be viewed as a short-hand notation for P(Aσ(Y)) and E[Xσ(Y)]. Similar interpretations hold for conditioning on multiple random variables (or, equivalently, conditioning on random vectors).

  2. The smoothing property of conditional expectation can then be stated as E[E[X|Y1,Y2]Y1]=E[X|Y1].

  3. An implication of the smoothing property is the following: for any (measurable) function g:RR, E[g(Y)E[XY]]=E[Xg(Y)].

  1. This previous property is used for generalizing the definition of conditional expectation to continuous random variables. First, we consider conditioning wrt σ-algebra GF, which is not necessarily finite (or countable).

    Then, for any non-negative random variable X, E[XG] is defined as a G-measurable random variable that satisfies E[1AE[XG]]=E[X1A] for every AG.

1 We start with non-negative random variables just to avoid the concerns with existence of expectation due to nature. A similar definition works in general as long as we can rule out case.

  1. It can be shown that E[XG] exists and is unique up to sets of measure zero. Formally, one takes about a “version” of conditional expectation.

  2. Then E[XY] for Y continuous may be viewed as E[Xσ(Y)].

  3. The formal definition of conditional expectation implies that if we take any Borel subsets BX of R, then P(XBXY) is a (measurable) function m(y) that satisfies (1)P(XBX,YBY)=BYm(y)fY(y)dy for all Borel subsets BY of R.

    We will show that m(y)=BXfX,Y(x,y)fY(y)dx satisfies (1). In particular, the RHS of (1) is BY[BXfX,Y(x,y)fY(y)dx]fY(y)dy=BYBXfX,Y(x,y)dxdy=P(XBX,YBX) which equals the LHS of (1). This is why the conditional density is defined the way it is defined!

  4. Finally, it can be shown that P(AY):=E[1Aσ(Y)], AF, satisfies the axioms of probability.Therefore, conditional probability satisfies all the properties of probability (and consequently, conditional expectations satisfy all the properties of expectations).

  5. Note that the definition of conditional expectation generalizes Bayes rule. In particular, for any (measurable) function g:RR we have E[g(X)Y=y]=g(x)fX|Y(xy)dx=g(x)fX,Y(x,y)fY(y)dx=g(x)fX,Y(x,y)dxfY(y)=g(x)fX,Y(x,y)dxfX,Y(x,y)dx=g(x)fY|X(y|x)fX(x)dxfY|X(y|x)fX(x)dx.

Exercise 2 Let X and Y be independent and identically distributed Bernoulli(p) random variables.

  1. Consider the events Ak={ω:X(ω)+Y(ω)=k}, k{0,1,2}. Find P(AkX).

  2. Compute E[X+YX].

close all nutshells