1 Introduction
stochastic optimization, principle of irrelevant information
Let’s start with the simplest optimization problem. A decision maker has to choose an action \(a \in \ALPHABET A\). Upon choosing the action \(a\), the decision maker incurs a cost \(c(a)\). What action should the decision maker pick to minimize the cost?
Formally, the above optimization problem may be written as \[\begin{equation} \label{eq:basic} \min_{a \in \ALPHABET A} c(a). \end{equation}\]
When the action space \(\ALPHABET A\) is finite, say \(\ALPHABET A = \{1, \dots, m\}\), solving the optimization problem \(\eqref{eq:basic}\) is conceptually straight-forward: enumerate the cost of all possible actions, i.e., enumerate the set \(C = \{ c(a) : a \in \ALPHABET A \}\) and find the smallest element.
When the action space \(\ALPHABET A\) is continuous, say a compact subset of a Euclidean space, solving the optimization problem \eqref{eq:basic} is conceptually straight-forward only when the cost function \(c\) satisfies some regularity conditions. For example, when \(c\) is convex, the optimal action can be obtained for solving \[ \dfrac {d c(a) }{ da } = 0. \]
In the absence of appropriate regularity conditions, it is not possible to solve an optimization problem over continuous action spaces.
1.1 The stochastic optimization problem
Now consider the simplest stochastic optimization problem. A decision maker has to choose an action \(a \in \ALPHABET A\). Upon choosing the action \(a\), the decision maker incurs a cost \(c(a,W)\), where \(W \in \ALPHABET W\) is a random variable with known probability distribution. Assume that the decision maker is risk neutral and, therefore, wants to minimize \(\EXP[ c(a, W) ]\), where the expectation is with respect to the random variable \(W\).
Formally, the above optimization problem may be written as \[\begin{equation} \label{eq:stochastic} \min_{a \in \ALPHABET A} \EXP[ c(a, W) ]. \end{equation}\]
Define \(J(a) = \EXP[ c(a, W) ]\). Then Problem \eqref{eq:stochastic} is conceptually the same as Problem \eqref{eq:basic} with the cost function \(J(a)\). Numerically, Problem \eqref{eq:stochastic} is more difficult because computing \(J(a)\) involves evaluating an expectation, but we ignore the computational complexity for the time being.
1.2 Key simplifying idea
In the stochastic optimization problems considered above, the decision maker does not observe any data before making a decision. In many situations, the decision maker does observe some data, which is captured by the following model. Suppose a decision maker observes a random variable \(S \in \ALPHABET S\) and then chooses an action \(A \in \ALPHABET A\) as a function of his observation according to a decision rule \(π\), i.e., \[ A = π(S). \]
Upon choosing the action \(A\), the decision maker incurs a cost \(c(S,A,W)\), where \(W \in \ALPHABET W\) is a random variable. We assume that the primitive random variables \((S,W)\) are defined on a common probability space and have a known joint distribution. Assume that the decision maker is risk neutral and, therefore, wants to minimize \(\EXP[ c(S, π(S), W)]\), where the expectation is taken with respect to the joint probability distribution of \((S,W)\).
Formally, the above optimization problem may be written as \[\begin{equation} \label{eq:obs} \tag{P1} \min_{π \colon \ALPHABET S \to \ALPHABET A} \EXP[ c(S, π(S), W) ]. \end{equation}\]
Define \(J(π) = \EXP[ c(S, π(S), W) ]\). Then, Problem \eqref{eq:obs} is conceptually the same as Problem \eqref{eq:basic} with one difference: In Problem \eqref{eq:basic}, the minimization is over a parameter \(a\), while in Problem \eqref{eq:obs}, the minimization is over a function \(π\).
When \(\ALPHABET S\) and \(\ALPHABET A\) are finite sets, the optimal policy can be obtained by an exhaustive search over all policies as follows: for each policy \(π\) compute the performance \(J(π)\) and then pick the policy \(π\) with the smallest expected cost.
Such an exhaustive search is not satisfying for two reasons. First, it has a high computational cost. There are \(| \ALPHABET A |^{| \ALPHABET S |}\) policies and, for each policy, we have to evaluate an expectation, which can be expensive. Second, the above enumeration procedure does not work when \(\ALPHABET S\) or \(\ALPHABET A\) are continuous sets.
There is an alternative way of viewing the problem that simplifies it considerably. Instead of viewing the optimization problem before the system starts running (i.e., the ex ante view), imagine that the decision maker waits until they see the realization \(s\) of \(S\) (i.e., the interim view). they then asks what action \(a\) should they take to minimize the expected conditional cost \(Q(s,a) := \EXP[ c(s,a, W) | S = s]\), i.e., they consider the problem
\[\begin{equation} \label{eq:cond-1} \tag{P2} \min_{a \in \ALPHABET A} \EXP[ c(s,a,W) | S = s], \quad \forall s \in \ALPHABET S. \end{equation}\]
Thus, Problem \eqref{eq:obs}, which is a functional optimization problem, has been reduced to a collection of parameter optimization problems (Problem \eqref{eq:cond-1}), one for each possible of \(s\).
Now define \[ \begin{equation} \label{eq:cond} \tag{P2-policy} π^∘(s) = \arg \min_{a \in \ALPHABET A} \EXP[ c(s,a, W) | S = s] \end{equation} \] where ties (in the minimization) are broken arbitrarily.
Theorem 1.1 The decision rule \(π^∘\) defined in \eqref{eq:cond} is optimal for Problem \(\eqref{eq:obs}\), i.e., \[ \min_{π \colon \ALPHABET S \to \ALPHABET A} \EXP[ c(S, π(S), W) ]. = \EXP \biggl[ \min_{a \in \ALPHABET A} \EXP[ c(S,a,W) | S ] \biggr] . \]
We restricted the proof to finite \(\ALPHABET S\), \(\ALPHABET A\), \(\ALPHABET W\). This is to avoid any measurability issues. If \(\ALPHABET S\) and \(\ALPHABET A\) are continuous sets, we need to restrict to measurable \(π\) in Problem \(\eqref{eq:obs}\) (otherwise the expectation is not well defined; of course the cost \(c\) also has to be measurable). However, it is not immediately obvious that \(π^∘\) defined in \eqref{eq:cond} is measurable. Conditions that ensure this are known as measurable selection theorems.
Let \(π\) be any other decision rule. Then, \[ \begin{align*} \EXP[ c(S, π(S), W) ] &\stackrel{(a)}= \EXP[ \EXP[c(S, π(S), W) | S ] ] \\ &\stackrel{(b)}\ge \EXP[\EXP[ c(S, π^∘(S), W) | S ] ] \\ &\stackrel{(c)}= \EXP[ c(S, π^∘(S), W) ], \end{align*} \] where \((a)\) and \((c)\) follow from the law of iterated expectations and \((b)\) follows from the definition of \(π^∘\) in \eqref{eq:cond}.
We can also provide a partial converse of Theorem 1.1.
Theorem 1.2 If \(\PR(S = s) > 0\) for all \(s\), then any optimal policy \(π^∘\) for Problem \(\eqref{eq:obs}\) must satisfy \(\eqref{eq:cond}\).
We prove this by contradiction. Suppose \(π^*\) is an optimal policy that does not satisfy \eqref{eq:cond}. By definition of \(π^∘\), it must be the case that for all states \[\begin{equation} \EXP[ c(s, π^∘(s), W) | S = s ] \le \EXP[ c(s, π^*(s), W) | S = s ] . \label{eq:ineq:1} \end{equation}\] Now, since \(π^*\) does not satisfy \eqref{eq:cond}, there exists some state \(s^∘ \in \ALPHABET S\) such that \[\begin{equation} \EXP[ c(s^∘, π^*(s^∘), W) | S = s^∘ ] > \EXP[ c(s^∘, π^∘(s^∘), W) | S = s^∘ ] . \label{eq:ineq:2} \end{equation}\] Therefore, \[\begin{align*} \EXP[ c(S, π^*(S), W) ] &= \sum_{s \in \ALPHABET S} \PR(S = s) \EXP[ \EXP[ c(s, π^*(s), W) | S = s ] ] \\ & \stackrel{(a)}> \sum_{s \in \ALPHABET S} \PR(S = s) \EXP[ \EXP[ c(s, π^∘(s), W) | S = s ] ] \\ &= \EXP[ c(S, π^∘(S), W) ] \end{align*}\] where \((a)\) follows from \eqref{eq:ineq:1} and \eqref{eq:ineq:2} and the inequality is strict becase \(\PR(S = s^∘) > 0\). Thus, \(J(π^*) > J(π^∘)\) and, hence, \(π^*\) cannot be an optimal policy.
To build up to the notation used in MDPs, define \[ Q(s,a) \coloneqq E[ c(s,a,W) \mid S = s], \] which is called the action-value function. For any policy \(π\), define \[ V^π(s) = Q(s,π(s)), \] which is called the value function of policy \(π\). The value function of the optimal policy is often denoted by \(V^*\) and, by definition of optimality, it has the property that for any policy \(π\): \[ V^*(s) \le V^π(s), \quad \forall s \in \ALPHABET S. \]
1.3 An example: optimal policy in a card game
Consier a gamler playing a stylized version of card game played with a deck of \(4\) cards: \(\{1,2,3,4\}\). A dealer deals two cards: one to himself and one to the gabler. The gambler looks at his card, and can decide to take one of two actions: fold or challenge. If the gambler folds, the game is over; if he challenges and has a higher card than the dealer, he wins \$1; if he challenges but has a lower than the dealer, he loses \$1. What is the optimal policy?
We can model this as a stochastic optimization probem. Let \(W\) denote the cards dealt, \(S\) denote gambler’s card, and \(A=0\) denote the fold action and \(A=1\) denote the challenge action. Then: \[ \ALPHABET W = \{(1,2), (1,3), (1,4), (2,1), (2,3), (2,4), (3,1), (3,2), (3,4), (4,1), (4,2), (4,3) \}, \] where the first index denotes the card dealt to the dealer and the second denotes the card dealt to the gabmler. We will assume that each of these outcomes is equally likely.
Note that \(\ALPHABET S = \{1,2,3,4\}\) and \(\ALPHABET A = \{0,1\}\). So, a policy \(π\) is a mapping from \(\ALPHABET S\) to \(\ALPHABET A\), i.e., for each card that could be dealt to the gambler, he needs to decide whether to fold or challenge.
There are \(|\ALPHABET S|^{|\ALPHABET A|} = 4^2 = 16\) policies. The performance of a policy \(π\) is: \[ J(π) = \sum_{(d,s) \in \ALPHABET W} \PR(d,s) c(s, π(s), (d,s)) \] where we are using the variable \(d\) to denote the card dealt to the dealer.
Now consider one specific policy: \(π = [0, 0, 1, 1]\), which means that the gambler folds if he gets card 1 or 2 and challenges if he gets card 3 and 4. Then, we have \[\begin{align*} J(π) &= \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (1,2)} + \underbrace{\tfrac 1{12} \cdot 1}_{(d,s) = (1,3)} + \underbrace{\tfrac 1{12} \cdot 1}_{(d,s) = (1,4)} \\ & \quad + \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (2,1)} + \underbrace{\tfrac 1{12} \cdot 1}_{(d,s) = (2,3)} + \underbrace{\tfrac 1{12} \cdot 1}_{(d,s) = (2,4)} \\ & \quad + \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (3,1)} + \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (3,2)} + \underbrace{\tfrac 1{12} \cdot 1}_{(d,s) = (3,4)} \\ & \quad + \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (4,1)} + \underbrace{\tfrac 1{12} \cdot 0}_{(d,s) = (4,2)} + \underbrace{\tfrac 1{12} \cdot (-1)}_{(d,s) = (4,3)} \\ &= \tfrac 4{12} = \tfrac 13. \end{align*}\]
A brute force search corresponds to computing the performance of all \(16\) policies and picking the one with the best performance.
The alternative way to compute the optimal policy via \(\eqref{eq:cond}\) proceeds as follows. First observe that \[ Q(s,a) \coloneqq \EXP[ c(s,a,W) \mid S = s] = \sum_{ (d,s) \in \ALPHABET W } \PR( (d,s) \mid S = s) c(s,a,(d,s)). \] Moreover, \(Q(s,0)\) is always \(0\). So, we just need to compute \(Q(s,1)\) and check if it is larger or smaller than \(0\).
- \(Q(1,1) = \tfrac 13 \bigl[ -1 -1 -1 \bigr] = -1 < 0\).
- \(Q(2,1) = \tfrac 13[1 -1 -1] = -\tfrac 13 < 0\).
- \(Q(3,1) = \tfrac 13[1 + 1 -1] = \tfrac 13 > 0\).
- \(Q(4,1) = \tfrac 13 [ 1 + 1 + 1] = 1 > 0\).
Thus, the optimal policy is \(π = [0, 0, 1, 1]\). We can also verify that \[ J(π) = \sum_{s \in \ALPHABET S} \PR(s) Q(s,π(s)) = \tfrac 14 [ 0 + 0 + \tfrac 13 + 1 ] = \tfrac 13 \] which is the same value that we had obtained by direct computation above.
1.4 A tree representation of the optimization problem
When all variables \((S,W,A)\) are finite valued, the stochastic optimization problem defined above can also be modeled as a tree. Such a tree formulation has two benefits:
- it provides a geometric view, which can be useful in visualizing the result;
- for general, multi-stage problems, it is possible to exploit the tree structure to obtain efficient computation algorithms, see, e.g., Fu (2018) for an historical overview.
There are two ways to construct the tree representation: the first as an one-stage optimization problem with perfect observation and the second as a one-stage optimization problem with imperfect observation. The first is simpler and the second is more general. We explain both formulations below.
1.4.1 One-step optimization with perfect observation
We will view the model according to the following timing diagram:
Thus, nature first generate the “state” variable \(S\) according to some distribution \(P_S\), agent takes action \(A\) according to the policy \(π \colon \ALPHABET S \to \ALPHABET A\), and then nature generates the “disturbance” \(W\) according to some distribution \(P_{W|S}\).
We can view this as a tree, shown in Figure 1.2, where the root note corresponds to moves nature (denoted by \(c\) for chance) to generate \(S\), nodes at depth 1 correspond to the moves of the agent (denoted by \(d\) for decision maker) to generate \(A\), and nodes at depth 2 correspond to moves of nature to generate \(W\).[^simplify]
[^simplify] For simplicity, we set \(W = (D,S)\) and only show \(D\) (the card dealt to the dealer) on the tree.
We can do the calculation to compute the optimal policy on the tree. The calculations proceed by starting at the last level and moving upwards. In particular, at the nodes of depth \(2\), we compute \(Q(s,a)\) using: \[ Q(s,a) = \EXP[ c(s,a,W) \mid S = s] = \sum_{w \in \ALPHABET W}P_{W|S}(w|s) c(s,a,w) \] If we are given a policy \(π\) (like the policy shown by yellow shaded edges in Figure 1.2), we can compute \(V^π(s)\) using: \[ V^π(s) = Q(s,π(s)). \] We can also compute the optimal value function \(V^*(s)\) using \[ V^*(s) = \min_{a \in \ALPHABET A}Q(s,a) \] where the optimal policy is given by \[ π^*(s) \in \arg\min_{a \in \ALPHABET A}Q(s,a). \]
These calculations are the same as we did in the previous section; the tree simply helps in visualizing what is going on.
1.4.2 One-step optimization with imperfect observation
In this case, we will view the model according to the following timing diagram:
We can view this as a tree, shown in Figure 1.4, where the root note corresponds to moves nature to generate \((S,W)\), nodes at depth 1 correspond to the moves of the agent to generate \(A\).
In this case, the decision maker cannot distinguish between multiple moves of nature. For example, for \(W \in \{ (2,1), (3,1), (4,1) \}\), the decision maker gets the same observation \(S=1\). This is indicated in the tree diagram by drawing an information set around these three nodes, and ensuring that the action taken at these nodes is identical.
In this formulation, the decision maker has imperfect observation of nature’s move. After making an observation, it forms a posterior belief on the state of nature. For example, \[ \PR(W | S = 1) = \begin{cases} \tfrac 13, & \hbox{if } W \in \{(2,1), (3,1), (4,1) \} \\ 0, & \hbox{otherwise} \end{cases} \] The agent then computes \(Q(s,a)\) by averaging over this belief: \[ Q(s,a) = \sum_{w \in \ALPHABET W} \PR(w|s) c(s,a,w). \] The value functions \(V^π(s)\) and \(V^*(s)\) and then computed exactly as before.
The difference between the two approaches is conceptual. For the one-step optimization problen, the actual mathematical calculations done in both formulations are the same. When we go to multi-stage optimization problem, we will see that perfect and imperfect observation models behave very differently.
1.5 Blackwell’s principle of irrelevant information
In many scenarios, the decision maker may observe data which is irrelevant for evaluating performance. In such instances, the decision maker may ignore such information without affecting performance. Formally, we have the following result, which is known as Blackwell’s principle of irrelevant information.
Theorem 1.3 (Blackwell’s principle of irrelevant information) Let \(\ALPHABET S\), \(\ALPHABET Y\), \(\ALPHABET W\), and \(\ALPHABET A\) be standard Borel spaces and \(S \in \ALPHABET S\), \(Y \in \ALPHABET Y\), \(W \in \ALPHABET W\) be random variables defined on a common probability space.
A decision maker observes \((S,Y)\) and chooses \(A = π(S,Y)\) to minimize \(\EXP[c(S,A,W)]\), where \(c \colon \ALPHABET S \times \ALPHABET A \times \ALPHABET W \to \reals\) is a measurable function.
Then, if \(W\) is conditionally independent of \(Y\) given \(S\), then there is no loss of optimality in choosing \(A\) only as a function of \(S\).
Formally, there exists a \(π^* \colon \ALPHABET S \to \ALPHABET A\) such that for all \(π \colon \ALPHABET S \times \ALPHABET Y \to \ALPHABET A\), \[ \EXP[c(S, π^*(S), W)] \le \EXP[ c(S, π(S,Y), W) ]. \]
We prove the result for the case when \(\ALPHABET S\), \(\ALPHABET Y\), \(\ALPHABET W\), \(\ALPHABET A\) are finite.
Define \[π^*(s) = \arg \min_{a \in \ALPHABET A} \EXP[ c(s,a, W) | S = s]. \] Then, by construction, for any \(s \in \ALPHABET S\) and \(a \in \ALPHABET A\), we have that \[ \EXP[ c(s, π^*(s), W ) | S = s] \le \EXP[ c(s,a,W) | S = s]. \] Hence, for any \(π \colon \ALPHABET S \times \ALPHABET Y \to \ALPHABET A\), and for any \(s \in \ALPHABET S\) and \(y \in \ALPHABET Y\), we have \[ \begin{equation} \label{eq:opt} \EXP[ c(s, π^*(s), W) | S = s] \le \EXP[ c(s, π(s,y),W) | S = s]. \end{equation} \] The result follows by taking the expectation of both sides of \eqref{eq:opt}.
The above proof doesn’t work for general Borel spaces because \(π^*\) defined above may not exist (inf vs min) or may not be measurable. See Blackwell (1964) for a formal proof.
Exercises
Exercise 1.1 (Computing optimal policies) Suppose \(\ALPHABET S = \{1, 2 \}\), \(\ALPHABET A = \{1, 2, 3\}\), and \(\ALPHABET W = \{1, 2, 3\}\). Let \((S,W)\) be random variables taking values in \(\ALPHABET S × \ALPHABET W\) with joint distribution \(P\) shown below.
\[ P = \MATRIX{ 0.25 & 0.15 & 0.05 \\ 0.30 & 0.10 & 0.15 } \]
Here the row corresponds to the value of \(s\) and the column corresponds to the value of \(w\). For example \(\PR(S=2, W=1) = P_{21} = 0.30\).
The cost function \(c \colon \ALPHABET S \times \ALPHABET A \times \ALPHABET W \to \reals\) is shown below
\[ c(\cdot,\cdot,1) = \MATRIX{3 & 5 & 1 \\ 2 & 3 & 1 }, \quad c(\cdot,\cdot,2) = \MATRIX{4 & 3 & 1 \\ 1 & 2 & 8 }, \quad c(\cdot,\cdot,3) = \MATRIX{1 & 2 & 2 \\ 4 & 1 & 3 }. \]
Here the row corresponds to the value of \(s\) and the column corresponds to the value of \(a\). For example \(c(s=1,a=2,w=1) = 5\).
Find the policy \(π \colon \ALPHABET S \to \ALPHABET A\) that minimizes \(\EXP[ c(S, π(S), W) ]\).
Exercise 1.2 (Blackwell’s principle) Suppose \(\ALPHABET S = \{1, 2\}\), \(\ALPHABET Y = \{1, 2\}\), \(\ALPHABET A = \{1, 2, 3\}\), and \(\ALPHABET W = \{1, 2, 3\}\). Let \((S,Y,W)\) be random variables taking values in \(\ALPHABET S × \ALPHABET Y × \ALPHABET W\), with joint distribution \(P\) shown below. \[ P_{Y = 1} = \MATRIX{0.15 & 0.10 & 0.00 \\ 0.15 & 0.05 & 0.10} \qquad P_{Y = 2} = \MATRIX{0.10 & 0.05 & 0.05 \\ 0.15 & 0.05 & 0.05} \] For a fixed value of \(y\), the row corresponds to the value of \(s\) and the column corresponds to the value of \(w\). For example \(\PR(S = 1, Y = 1, W = 3) = 0\). Note that the joint distribution of \((S,Y,W)\) is such that the margin on \((S,W)\) is the same as the distribution on \((S,W)\) given in the previous exercise (Exercise 1.1).
The cost function \(c \colon \ALPHABET S × \ALPHABET A × \ALPHABET W \to \reals\) is the same as the previous exercise.
Find the policy \(π \colon \ALPHABET S × \ALPHABET Y \to \ALPHABET A\) that minimizes \(\EXP[c(S, π(S,Y), W)]\).
Compare the solution with the solution of the previous exercise (Exercise 1.1) in view of Blackwell’s principle of irrelevant information. Clearly explain your observations.
Repeat the above exercise with the following joint distribution: \[ P_{Y = 1} = \MATRIX{0.20 & 0.12 & 0.04 \\ 0.24 & 0.08 & 0.12} \qquad P_{Y = 2} = \MATRIX{0.05 & 0.03 & 0.01 \\ 0.06 & 0.02 & 0.03} \]
Exercise 1.3 (Pollution monitoring) Consider the problem of monitoring the pollution level of a river. The river can have a high pollution level if there is a catastrophic failure of a factory upstream. There are then two “pollution states” indicating whether such a failure has occured. We denote them by \(W = 0\) (indicating no failure) and \(W = 1\) (indicating catastrophic failure). Let \([p, 1-p]\) denote the prior probability mass function of \(W\).
The pollution monitoring system has a sensor which takes a measurement \(s\) of the pollution level. Let \(f_w(s)\) denote the probabiity density of the observation \(s\) conditional on the value of \(w\), \(w \in \{0, 1\}\). Two actions are available at the monitoring system: raise an alarm or not raise an alarm. The cost of raising the alarm is \(C_0\) if the state \(W\) is \(0\) or zero if the state \(W\) is \(1\); the cost of not raising the alarm is zero if the state \(W\) is \(0\) or \(C_1\) if the state \(W\) is \(1\).
Show that it is optimal to raise the alarm if \[ p f_0(s) C_0 < (1 - p) f_1(s) C_1. \] That is, it is optimal to raise the alarm if the likelihood ratio \(f_1(s)/f_0(s)\) exceeds the threshold value \(p C_0/(1-p) C_1\).
Exercise 1.4 (Pollution monitoring, continued) The decision rule obtain in Exercise 1.3 is of the form: raise the alarm if \[ \frac{f_1(s)}{f_0(s)} > τ, \quad \hbox{where } τ = \frac{p C_0}{(1-p)C_1}. \] When the observation density belongs to the :exponential family, it is more convenient to work with the log-likelihood ratio \(\ALPHABET L(s) \coloneqq \log (f_1(s)/f_0(s))\) and use the test: raise the alarm if \[ \ALPHABET L(s) > \log τ. \] This is called the log-likelihood ratio test (LLRT).
Suppose that the observation density is conditionally Gaussian and is given by: \[ f_w(s) = \exp\biggl( - \frac{ (s-w)^2 }{ 2 σ^2 }\biggr) \] where \(σ > 0\) is known. Simplify the LLRT. Does the decision rule intuitively make sense?
Exercise 1.5 (Stochastic/Randomized policies) In the discussion above, we have assumed that decision rule \(π\) is deterministic. It is also possible to consider stochastic/randomized decision rules: \(π \colon \ALPHABET S \to Δ(\ALPHABET A)\), where the performance of a policy is given by \[ J(π) = \sum_{s,w \in \ALPHABET S × \ALPHABET W} \sum_{a \in \ALPHABET A} P(s,w) π(a \mid s) c(s,a,w). \] Now consider the problem of \(\min_{π : \ALPHABET S \to Δ(\ALPHABET A)} J(π)\). Show that the result of Theorem 1.1 remains valid even when we allow stochastic/randomization.
Note: This fact is sometimes stated as “randomization does not improve performance”, but is only true for unconstrained problems. When constraints are involved, randomization may improve performance.
Notes
Theorem 1.3 is due to Blackwell (1964) in a short 2.5 page paper. A similar result was used by Witsenhausen (1979) to show the structure of optimal coding strategies in real-time communication. Also see the blog post by Maxim Ragisnsky.
Exercise 1.3 is adaptive from Whittle (1996). It is a special instance of Bayesian hypothesis testing problem. We will study a generalization of this model later in sequential hypothesis testing
Historically, the tree representation of optimization problems is due to Kuhn (1950; 1953), who used them to model multi-player multi-stage games.