2  Dealing with observations

Updated

January 2, 2026

Keywords

stochastic optimization, principle of irrelevant information

A key feature that differentiates stochastic optimization problems from deterministic ones is that the decision maker may have some information about the state of the world, before it makes a decision. In the previous chapter, we assumed that this was not the case but in many situations, the decision maker does observe some data before making a decision. In this chapter, we present various models for capturing such situations.

2.1 Stochastic optimization with partial observations

A general model of stochastic optimization with partial observations is as follows. Let \(W \in \ALPHABET W\) denote the state of the world. The decision maker makes an observation \(S \in \ALPHABET S\) and then chooses an action \(A \in \ALPHABET A\) as function of his observations according to a decision rule or policy \(π \colon \ALPHABET S \to \ALPHABET A\), i.e., \[ A = π(S). \]

For the ease of notation, we will use \(Π\) to denote all deterministic (and measurable) policies \(π \colon \ALPHABET S \to \ALPHABET A\).

Upon choosing the action \(A\), the decision maker incurs a cost \(c(S,A,W)\). We assume that the primitive random variables \((W,S)\) are defined on a common probability space and have a known joint distribution. Assume that the decision maker is risk neutral and, therefore, wants to minimize \(\EXP[ c(S, π(S), W)]\), where the expectation is taken with respect to the joint probability distribution of \((W,S)\).

The temporal order in which the system variables are generated is as follows. First nature acts and generates all the primitive \(ω = (W,S)\) according to a given probability distribution. The agent them observes \(S(ω)\) and takes an action \(A\) according to a policy \(π \colon \ALPHABET S \to \ALPHABET A\). The temporal order can be visualized according to the following timing diagram.

Figure 2.1: Timing diagram for modeling as a one-step optimization problem with partial observation

Formally, the above optimization problem may be written as \[\begin{equation} \label{eq:obs} \tag{P1} \min_{π \in Π} \EXP[ c(S, π(S), W) ]. \end{equation}\]

Define \[J(π) = \EXP[ c(S, π(S), W) ].\] Then, a policy \(π^*\) is optimal if \[ J(π^*) \le J(π). \] When \(\ALPHABET S\) and \(\ALPHABET A\) are finite sets, then an optimal policy always exists and can be found by brute force search. When \(\ALPHABET S\) and/or \(\ALPHABET A\) are continuous, we need some regularity assumptions on the model to ensure the existence of an optimal policy.

Then, Problem \(\eqref{eq:obs}\) is conceptually the same as the stochastic optimization problems considered in the previous section but with one difference: the basic stochastic optimization problem is a parameter optimization problem where the minimization is over a parameter \(a\), while Problem \(\eqref{eq:obs}\) is a functional optimization problem where the minimization is over a function \(π\).

When \(\ALPHABET S\) and \(\ALPHABET A\) are finite sets, the optimal policy can be obtained by an exhaustive search over all policies as follows: for each policy \(π\) compute the performance \(J(π)\) and then pick the policy \(π\) with the smallest expected cost.

Such an exhaustive search is not satisfying for two reasons. First, it has a high computational cost. There are \(| \ALPHABET A |^{| \ALPHABET S |}\) policies and, for each policy, we have to evaluate an expectation, which can be expensive. Second, the above enumeration procedure does not work when \(\ALPHABET S\) or \(\ALPHABET A\) are continuous sets.

There is an alternative way of viewing the problem that simplifies it considerably. Before presenting this alternative, we present a way for modeling such optimization problem using trees when all variables \((W,S,A)\) are finite valued. Such a tree formulation has two benefits:

  • it provides a geometric view, which can be useful in visualizing the result;
  • for general, multi-stage problems, it is possible to exploit the tree structure to obtain efficient computation algorithms, see, e.g., Fu (2018) for an historical overview.

2.2 A tree representation of stochastic optimization with partial information

To fix ideas, we will work with the following simple example.

Example 2.1 (A simple card game) Consider a gambler playing a stylized version of card game played with a perfectly shuffled deck of \(4\) cards: \(\{1,2,3,4\}\). A dealer deals two cards: one to himself and one to the gambler. The gambler looks at his card, and can decide to take one of two actions: fold or challenge. If the gambler folds, the game is over; if he challenges and has a higher card than the dealer, he wins \$1; if he challenges but has a lower card than the dealer, he loses \$1. What is the optimal policy?

To map Example 2.1 to the general model, let \(W\) denote the cards dealt, \(S\) denote gambler’s card, and \(A=0\) denote the fold action and \(A=1\) denote the challenge action. Then: \[ \ALPHABET W = \{(1,2), (1,3), (1,4), (2,1), (2,3), (2,4), (3,1), (3,2), (3,4), (4,1), (4,2), (4,3) \}, \] where the first index denotes the card dealt to the dealer and the second denotes the card dealt to the gambler. We have assumed that each of these outcomes is equally likely.

Note that \(\ALPHABET S = \{1,2,3,4\}\) and \(\ALPHABET A = \{0,1\}\). So, a policy \(π\) is a mapping from \(\ALPHABET S\) to \(\ALPHABET A\), i.e., for each card that could be dealt to the gambler, he needs to decide whether to fold or challenge.

We can view the sequence of events in Example 2.1 as a tree, shown in Figure 2.2, where the root node corresponds to moves of nature to generate \((W,S)\), nodes at depth 1 correspond to the moves of the agent to generate \(A\).

Figure 2.2: Tree representation of the one-step optimization problem. Nodes represent the “agent”: nature or decision maker; the edges represent the moves of the agents; and the numbers at the leaf nodes represent the cost \(c(s,a,w)\). The shaded edges represent a policy.

In this case, the observation \(S\) gives the decision maker some partial information about the move of nature. For example, when \(S=1\), the decision maker knows that \(W \in \{ (2,1), (3,1), (4,1) \}\), but cannot further distinguish between them This information is indicated in the tree diagram by drawing an information set around these three nodes, and ensuring that the action taken at these nodes is identical.

2.3 The main simplifying idea

The key simplifying idea is that instead of viewing the optimization problem before the system starts running (i.e., the ex ante view), imagine that the decision maker waits until they see the realization \(s\) of \(S\) (i.e., the interim view). In other words, we think of the decision maker at each information set as a separate agent and determine the optimal action for that agent.

For instance, in Example 2.1, after making an observation \(S=1\), the gambler forms a posterior belief on the possible states of nature using Bayes rule. In particular, \[ \PR(W | S = 1) = \begin{cases} \tfrac 13, & \hbox{if } W \in \{(2,1), (3,1), (4,1) \} \\ 0, & \hbox{otherwise} \end{cases} \]

Now consider the sub-tree corresponding to observation \(S=1\). In this sub-tree, the agent has two actions: \(A = 0\) or \(A = 1\). We compute the expected cost corresponding to each action, i.e., compute \[ Q(s,a) = \EXP[ c(s,a,W) \mid S = s ] = \sum_{w \in \ALPHABET W} \PR(w|s) c(s,a,w). \] where the expectation is with respect to the posterior belief \(P_{W|S}\) computed above. The function \(Q\) is called the action-value function.

For Example 2.1 \(Q(s,0)\) is always \(0\). We explicitly compute \(Q(s,1)\) for all values of \(s\):

  • \(Q(1,1) = \tfrac 13 \bigl[ -1 -1 -1 \bigr] = -1 < 0\).
  • \(Q(2,1) = \tfrac 13[1 -1 -1] = -\tfrac 13 < 0\).
  • \(Q(3,1) = \tfrac 13[1 + 1 -1] = \tfrac 13 > 0\).
  • \(Q(4,1) = \tfrac 13 [ 1 + 1 + 1] = 1 > 0\).
TipMain claim

The action value functions are sufficient to compute the optimal policy as well as the performance of any policy.

In particular, for any (deterministic) policy \(\pi\), we define the value function of the policy as \[ V^{π}(s) = Q(s, π(s)). \] Then, \(J(π) = \EXP[ V^{π}(S) ]\).

Define the optimal value function as \[ V^*(s) = \min_{a \in \ALPHABET A} Q(s,a), \quad \forall s \in \ALPHABET S. \] Note that to compute the optimal value function, we need to solve a collection of parameter optimization problems, one for each value of \(s\).

Now define \[\begin{equation} \label{eq:cond} \tag{P2-policy} π^∘(s) \in \arg \min_{a \in \ALPHABET A} \EXP[ c(s,a, W) | S = s] \end{equation}\] where ties (in the minimization) are broken arbitrarily.

Theorem 2.1 The decision rule \(π^∘\) defined in \(\eqref{eq:cond}\) is optimal for Problem \(\eqref{eq:obs}\), i.e., \[ \min_{π \in Π} \EXP[ c(S, π(S), W) ]. = \EXP \biggl[ \min_{a \in \ALPHABET A} \EXP[ c(S,a,W) | S ] \biggr] . \]

ImportantRemark

We restricted the proof to finite \(\ALPHABET S\), \(\ALPHABET A\), \(\ALPHABET W\). This is to avoid any measurability issues. If \(\ALPHABET S\) and \(\ALPHABET A\) are continuous sets, we need to restrict to measurable \(π\) in Problem \(\eqref{eq:obs}\) (otherwise the expectation is not well defined; of course the cost \(c\) also has to be measurable). However, it is not immediately obvious that \(π^∘\) defined in \(\eqref{eq:cond}\) is measurable. Conditions that ensure this are known as measurable selection theorems.

Let \(π\) be any arbitrary policy. Then, \[ V^{π}(s) \stackrel{(a)}\ge V^*(s) \stackrel{(b)}= V^{π^∘}(s) \] where \((a)\) follows from the definition of \(V^*\) and \((b)\) follows from the definition of \(π^∘\) in \(\eqref{eq:cond}\). Taking expectation of both sides, we get \[ J(π) \ge J(π^∘). \] Since \(π\) was arbitrary, this implies that \(π^∘\) is optimal.

We can also provide a partial converse of Theorem 2.1

Theorem 2.2 If \(\PR(S = s) > 0\) for all \(s\), then any optimal policy \(π^∘\) for Problem \(\eqref{eq:obs}\) must satisfy \(\eqref{eq:cond}\).

We prove this by contradiction. Suppose \(π^*\) is an optimal policy that does not satisfy \(\eqref{eq:cond}\). By definition of \(π^∘\), it must be the case that for all states \[\begin{equation} V^{π^∘}(s) \le V^{π^*}(s) \label{eq:ineq:1} \end{equation}\] Now, since \(π^*\) does not satisfy \(\eqref{eq:cond}\), there exists some state \(s^∘ \in \ALPHABET S\) such that \[\begin{equation} V^{π^*}(s ^∘) > V^{π^∘}(s ^∘). \label{eq:ineq:2} \end{equation}\] Therefore, \[\begin{align*} J(π^*) &= \sum_{s \in \ALPHABET S} \PR(S = s) V^{π^*}(s) \\ & \stackrel{(a)}> \sum_{s \in \ALPHABET S} \PR(S = s) V^{π^∘}(s) \\ &= J(π^∘) \end{align*}\] where \((a)\) follows from \(\eqref{eq:ineq:1}\) and \(\eqref{eq:ineq:2}\) and the inequality is strict because \(\PR(S = s^∘) > 0\). Thus, \(J(π^*) > J(π^∘)\) and, hence, \(π^*\) cannot be an optimal policy.

2.4 Policy evaluation

A brute force evaluation of a policy is tedious. For instance, consider a specific policy for Example 2.1: \(π = [0, 0, 1, 1]\), which means that the gambler folds if he gets card 1 or 2 and challenges if he gets card 3 and 4. Then, we have \[\begin{align*} J(π) &= \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (1,2)} + \underbrace{\tfrac 1{12} \cdot 1}_{(d,s) = (1,3)} + \underbrace{\tfrac 1{12} \cdot 1}_{(d,s) = (1,4)} \\ & \quad + \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (2,1)} + \underbrace{\tfrac 1{12} \cdot 1}_{(d,s) = (2,3)} + \underbrace{\tfrac 1{12} \cdot 1}_{(d,s) = (2,4)} \\ & \quad + \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (3,1)} + \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (3,2)} + \underbrace{\tfrac 1{12} \cdot 1}_{(d,s) = (3,4)} \\ & \quad + \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (4,1)} + \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (4,2)} + \underbrace{\tfrac 1{12} \cdot (-1)}_{(d,s) = (4,3)} \\ &= \tfrac 4{12} = \tfrac 13. \end{align*}\]

We can evaluate the performance of a policy using the action value functions we well. Note that the policy \(π\) is shown by the shaded edges in Figure 2.2. The action value functions capture the performance of the policy in the sub-trees corresponding to each information set. So, the overall performance is obtained by averaging over the probability of each information set, i.e., \[ J(π) = \sum_{s \in \ALPHABET S} \PR(s) Q(s,π(s)) = \tfrac 14 [ 0 + 0 + \tfrac 13 + 1 ] = \tfrac 13 \] which is the same value that we had obtained by direct computation above.

2.5 Blackwell’s principle of irrelevant information

In many scenarios, the decision maker may observe data which is irrelevant for evaluating performance. In such instances, the decision maker may ignore such information without affecting performance. Formally, we have the following result, which is known as Blackwell’s principle of irrelevant information.

Theorem 2.3 (Blackwell’s principle of irrelevant information) Let \(\ALPHABET S\), \(\ALPHABET Y\), \(\ALPHABET W\), and \(\ALPHABET A\) be standard Borel spaces and \(S \in \ALPHABET S\), \(Y \in \ALPHABET Y\), \(W \in \ALPHABET W\) be random variables defined on a common probability space.

A decision maker observes \((S,Y)\) and chooses \(A = π(S,Y)\) to minimize \(\EXP[c(S,A,W)]\), where \(c \colon \ALPHABET S \times \ALPHABET A \times \ALPHABET W \to \reals\) is a measurable function.

Then, if \(Y\) is conditionally independent of \(W\) given \(S\), then there is no loss of optimality in choosing \(A\) only as a function of \(S\).

Formally, there exists a \(π^* \colon \ALPHABET S \to \ALPHABET A\) such that for all \(π \colon \ALPHABET S \times \ALPHABET Y \to \ALPHABET A\), \[ \EXP[c(S, π^*(S), W)] \le \EXP[ c(S, π(S,Y), W) ]. \]

We prove the result for the case when \(\ALPHABET S\), \(\ALPHABET Y\), \(\ALPHABET W\), \(\ALPHABET A\) are finite.

Define \[π^*(s) = \arg \min_{a \in \ALPHABET A} \EXP[ c(s,a, W) | S = s]. \] Then, by construction, for any \(s \in \ALPHABET S\) and \(a \in \ALPHABET A\), we have that \[ \EXP[ c(s, π^*(s), W ) | S = s] \le \EXP[ c(s,a,W) | S = s]. \] Hence, for any \(π \colon \ALPHABET S \times \ALPHABET Y \to \ALPHABET A\), and for any \(s \in \ALPHABET S\) and \(y \in \ALPHABET Y\), we have \[ \begin{equation} \label{eq:opt} \EXP[ c(s, π^*(s), W) | S = s] \le \EXP[ c(s, π(s,y),W) | S = s]. \end{equation} \] The result follows by taking the expectation of both sides of \(\eqref{eq:opt}\).

The above proof doesn’t work for general Borel spaces because \(π^*\) defined above may not exist (inf vs min) or may not be measurable. See Blackwell (1964) for a formal proof.

Exercises

Exercise 2.1 (Computing optimal policies) Suppose \(\ALPHABET S = \{1, 2 \}\), \(\ALPHABET A = \{1, 2, 3\}\), and \(\ALPHABET W = \{1, 2, 3\}\). Let \((W,S)\) be random variables taking values in \(\ALPHABET S × \ALPHABET W\) with joint distribution \(P\) shown below.

\[ P = \MATRIX{ 0.25 & 0.15 & 0.05 \\ 0.30 & 0.10 & 0.15 } \]

Here the row corresponds to the value of \(s\) and the column corresponds to the value of \(w\). For example \(\PR(S=2, W=1) = P_{21} = 0.30\).

The cost function \(c \colon \ALPHABET S \times \ALPHABET A \times \ALPHABET W \to \reals\) is shown below

\[ c(\cdot,\cdot,1) = \MATRIX{3 & 5 & 1 \\ 2 & 3 & 1 }, \quad c(\cdot,\cdot,2) = \MATRIX{4 & 3 & 1 \\ 1 & 2 & 8 }, \quad c(\cdot,\cdot,3) = \MATRIX{1 & 2 & 2 \\ 4 & 1 & 3 }. \]

Here the row corresponds to the value of \(s\) and the column corresponds to the value of \(a\). For example \(c(s=1,a=2,w=1) = 5\).

Find the policy \(π \colon \ALPHABET S \to \ALPHABET A\) that minimizes \(\EXP[ c(S, π(S), W) ]\).

Exercise 2.2 (Blackwell’s principle) Suppose \(\ALPHABET S = \{1, 2\}\), \(\ALPHABET Y = \{1, 2\}\), \(\ALPHABET A = \{1, 2, 3\}\), and \(\ALPHABET W = \{1, 2, 3\}\). Let \((S,Y,W)\) be random variables taking values in \(\ALPHABET S × \ALPHABET Y × \ALPHABET W\), with joint distribution \(P\) shown below. \[ P_{Y = 1} = \MATRIX{0.15 & 0.10 & 0.00 \\ 0.15 & 0.05 & 0.10} \qquad P_{Y = 2} = \MATRIX{0.10 & 0.05 & 0.05 \\ 0.15 & 0.05 & 0.05} \] For a fixed value of \(y\), the row corresponds to the value of \(s\) and the column corresponds to the value of \(w\). For example \(\PR(S = 1, Y = 1, W = 3) = 0\). Note that the joint distribution of \((S,Y,W)\) is such that the marginal distribution of \((W,S)\) is the same as the distribution on \((W,S)\) given in the previous exercise (Exercise 2.1).

The cost function \(c \colon \ALPHABET S × \ALPHABET A × \ALPHABET W \to \reals\) is the same as the previous exercise.

  1. Find the policy \(π \colon \ALPHABET S × \ALPHABET Y \to \ALPHABET A\) that minimizes \(\EXP[c(S, π(S,Y), W)]\).

  2. Compare the solution with the solution of the previous exercise (Exercise 2.1) in view of Blackwell’s principle of irrelevant information. Clearly explain your observations.

  3. Repeat the above exercise with the following joint distribution: \[ P_{Y = 1} = \MATRIX{0.20 & 0.12 & 0.04 \\ 0.24 & 0.08 & 0.12} \qquad P_{Y = 2} = \MATRIX{0.05 & 0.03 & 0.01 \\ 0.06 & 0.02 & 0.03} \]

Exercise 2.3 (Pollution monitoring) Consider the problem of monitoring the pollution level of a river. The river can have a high pollution level if there is a catastrophic failure of a factory upstream. There are then two “pollution states” indicating whether such a failure has occurred. We denote them by \(W = 0\) (indicating no failure) and \(W = 1\) (indicating catastrophic failure). Let \([p, 1-p]\) denote the prior probability mass function of \(W\).

The pollution monitoring system has a sensor which takes a measurement \(s\) of the pollution level. Let \(f_w(s)\) denote the probability density of the observation \(s\) conditional on the value of \(w\), \(w \in \{0, 1\}\). Two actions are available at the monitoring system: raise an alarm or not raise an alarm. The cost of raising the alarm is \(C_0\) if the state \(W\) is \(0\) or zero if the state \(W\) is \(1\); the cost of not raising the alarm is zero if the state \(W\) is \(0\) or \(C_1\) if the state \(W\) is \(1\).

Show that it is optimal to raise the alarm if \[ p f_0(s) C_0 < (1 - p) f_1(s) C_1. \] That is, it is optimal to raise the alarm if the likelihood ratio \(f_1(s)/f_0(s)\) exceeds the threshold value \(p C_0/(1-p) C_1\).

Exercise 2.4 (Pollution monitoring, continued) The decision rule obtain in Exercise 2.3 is of the form: raise the alarm if \[ \frac{f_1(s)}{f_0(s)} > τ, \quad \hbox{where } τ = \frac{p C_0}{(1-p)C_1}. \] When the observation density belongs to the :exponential family, it is more convenient to work with the log-likelihood ratio \(\ALPHABET L(s) \coloneqq \log (f_1(s)/f_0(s))\) and use the test: raise the alarm if \[ \ALPHABET L(s) > \log τ. \] This is called the log-likelihood ratio test (LLRT).

Suppose that the observation density is conditionally Gaussian and is given by: \[ f_w(s) = \exp\biggl( - \frac{ (s-w)^2 }{ 2 σ^2 }\biggr) \] where \(σ > 0\) is known. Simplify the LLRT. Does the decision rule intuitively make sense?

Exercise 2.5 (Stochastic/Randomized policies) In the discussion above, we have assumed that decision rule \(π\) is deterministic. It is also possible to consider stochastic/randomized decision rules: \(π \colon \ALPHABET S \to Δ(\ALPHABET A)\), where the performance of a policy is given by \[ J(π) = \sum_{s,w \in \ALPHABET S × \ALPHABET W} \sum_{a \in \ALPHABET A} P(s,w) π(a \mid s) c(s,a,w). \] Now consider the problem of \(\min_{π : \ALPHABET S \to Δ(\ALPHABET A)} J(π)\). Show that the result of Theorem 2.1 remains valid even when we allow stochastic/randomization.

Note: This fact is sometimes stated as “randomization does not improve performance”, but is only true for unconstrained problems. When constraints are involved, randomization may improve performance.

Notes

Historically, the tree representation of optimization problems is due to Kuhn (1950; 1953), who used them to model multi-player multi-stage games.

Theorem 2.3 is due to Blackwell (1964) in a short 2.5 page paper. A similar result was used by Witsenhausen (1979) to show the structure of optimal coding strategies in real-time communication. Also see the blog post by Maxim Ragisnsky.

Exercise 2.3 is adaptive from Whittle (1996). It is a special instance of Bayesian hypothesis testing problem. We will study a generalization of this model later in sequential hypothesis testing