ECSE 506: Stochastic Control and Decision Theory
Aditya Mahajan
Winter 2022
About  Lectures  Notes  Coursework
So far, we have considered a setup where the decision maker perfectly observes the state of the system. In many applications, the decision maker may not directly observe the state of the system but only observe a noisy version of it. Such systems are modeled as partially observable Markov decision processes (POMDPs). We will describe the simplest model of POMDPs, which builds upon the model of MDPs descibed earlier.
We assume that the system has a state \(S_t \in \ALPHABET S\), control input \(A_t \in \ALPHABET A\), and process noise \(W_t \in \ALPHABET W\). The state evolves as \[\begin{equation} \label{eq:state} S_{t+1} = f_t(S_t, A_t, W_t) \end{equation}\] However, unlike the MDP setup, the assumption is that the decision maker does not observe \(S_t\); rather, the observation of the decision maker at time \(t\) is given by \[\begin{equation} \label{eq:obs} Y_t = \ell_t(S_t, N_t) \end{equation}\] where \(Y_t \in \ALPHABET Y\) is the observation and \(N_t \in \ALPHABET N\) is called the observation noise. As in the case of MDPs, we assume that the primitive random varaibles \((S_1, W_1, \dots, W_T\), \(N_1, \dots, N_T)\) are defined on a common probability space and are mutually independent. This assumption is critical for the results to go through.
As in the case of MDPs, we assume that the controller can be as sophisticated as we want. It can analyze the entire history of observations and control actions to choose the current control action. Thus, the control action can be written as \[ A_t = π_t(Y_{1:t}, A_{1:t1}). \]
At each time, the system incurs a cost \(c_t(S_t, A_t)\) which depends on the current state and the current action. The system operates for a finite horizon \(T\) and incurs a total cost \[ \sum_{t=1}^T c_t(S_t, A_t). \]
Given the above system model, we want to choose a control strategy \(π = (π_1, \dots, π_T)\) to minimize the expected total cost \[ J(π) := \EXP\Bigl[ \sum_{t=1}^T c_t(S_t, A_t) \Bigr]. \] How should we proceed?
Note that the only difference from the MDP model is decision maker observes \(Y_t\) instead of \(S_t\). Apart from this, the other modeling assumptions are the same. So, the conceptual difficulties of the model are the same as that of MDPs:
The data \((Y_{1:t}, A_{1:t1})\) available at the controller is increasing with time. Therefore, the number of possible control laws at time \(t\) are increasing exponentially with time. How can we search for efficiently search for optimal control strategies?
Recall that for MDPs, we first showed that there is no loss of optimality in restricting attention to Markov strategies. That structural result was instrumental in developing an efficient search algorithm (dynamic programming). So, what is the equivalent result for POMDPs?
1 History dependent dynamic program
Our first step to develop an efficient dynamic programming decomposition is to simply ignore efficiency and develop a dynamic programming decomposition. We start by deriving a recursive formula to compute the performance of a generic history dependent strategy \(π = (π_1, \dots, π_T)\).
1.1 Performace of historydependent strategies
Let \(H_t = (Y_{1:t}, A_{1:t1})\) denote all the information available to the decision maker at time \(t\). Thus, given any history dependent strategy \(π\), we can write \(A_t = π_t(H_t)\). Define the costtogo functions as follows: \[ J_t(h_t; π) = \EXP^π\biggl[ \sum_{s=t}^T c_s(S_s, A_s) \biggm H_t = h_t \biggr]. \] Note that \(J_t(h_t; π)\) only depends on the future strategy \((π_t, \dots, π_T)\). These functions can be computed recursively as follows: \[\begin{align*} J_t(h_t; π) &= \EXP^π\biggl[ \sum_{s=t}^T c_s(H_s, π_s(H_s)) \biggm H_t = h_t \biggr] \\ &\stackrel{(a)}= \EXP^π \biggl[ c_t(h_t, π_t(h_t)) + \EXP^π\biggl[ \sum_{s=t+1}^T c_s(S_s, π_s(S_s)) \biggm H_{t+1} \biggr] \biggm H_t = h_t \biggr] \\ &= \EXP^π[ c_t(h_t, π_t(h_t)) + J_{t+1}(H_{t+1}; π) \mid H_t = h_t ], \end{align*}\] where \((a)\) follows from the towering property of conditional expectation and the fact that \(H_t \subseteq H_{t+1}\).
Thus, we can use the following dynamic program to recursively compute the performance of a historydependent strategy: \(J_{T+1}(h_{T+1}) = 0\) and for \(t \in \{T, \dots, 1\}\), \[ J_t(h_t; π) = \EXP^π [ c_t(h_t, π_t(h_t)) + J_{t+1}(H_{t+1}; π) \mid H_t = h_t ]. \]
1.2 Historydependent dynamic programming decomposition
We can use the above recursive formulation for performance evaluation to derive a historydependent dynamic program.
 Theorem 1

Recursively define _value functions \(\{V_t\}_{t = 1}^{T+1}\), where \(V_t \colon \ALPHABET H_t \to \reals\) as follows: \[\begin{equation} V_{T+1}(h_{T+1}) = 0 \end{equation}\] and for \(t \in \{T, \dots, 1\}\): \[\begin{align} Q_t(h_t, a_t) &= \EXP[ c_t(S_t, a_t) + V_{t+1}(H_{t+1}) \mid H_t = h_t, A_t = a_t ] \\ V_t(h_t) &= \min_{a_t \in \ALPHABET A} Q_t(h_t, a_t) \end{align}\] Then, a historydependent policy \(π\) is optimal if and only if it satisfies \[\begin{equation} \label{eq:historyverification} π_t(h_t) \in \arg \min_{a_t \in \ALPHABET A} Q_t(h_t, a_t). \end{equation}\]
The proof idea is similar to the proof for MDPs. Instead of proving the above result, we prove a related result.
 Theorem 2

(The comparison principle) For any historydependent strategy \(π\) \[ J_t(h_t; π) \ge V_t(h_t) \] with equality at \(t\) if and only if the future straegy \(π_{t:T}\) satisfies the verification step \eqref{eq:historyverification}.
Note that the comparison principle immediately implies that the strategy obtained using dynamic program of Theorem 1 is optimal. The proof of the comparison principle is almost identical to the proof for MDPs.
Proof of the comparison principle
The proof proceeds by backward induction. Consider any history dependent policy \(π = (π_1, \dots, π_T)\). For \(t = T+1\), the comparison principle is satisfied by definition and this forms the basis of induction. We assume that the result holds for time \(t+1\), which is the induction hypothesis. Then for time \(t\), we have \[\begin{align*} V_t(h_t) &= \min_{a_t \in \ALPHABET A} Q_t(h_t, a_t) \\ &\stackrel{(a)}= \min_{a_t \in \ALPHABET A} \EXP^π[ c_t(S_t, π_t(h_t)) + V_{t+1}(H_{t+1}) \mid H_t = h_t, A_t = π_t(h_t) ] \\ &\stackrel{(b)}\le \EXP^π[ c_t(S_t, π_t(h_t)) + V_{t+1}(H_{t+1}) \mid H_t = h_t, A_t = π_t(h_t)] \\ &\stackrel{(c)}\le \EXP^π[ c_t(S_t, π_t(h_t)) + J_{t+1}(H_{t+1}; π) \mid H_t = h_t, A_t = π_t(h_t)] \\ &= J_t(h_t, π). \end{align*}\] where \((a)\) follows from the definition of the \(Q\)function; \((b)\) follows from the definition of minimization; and \((c)\) follows from the induction hyothesis. We have the equality at step \((b)\) iff \(π_t\) satisfies the verification step \eqref{eq:historyverification} and have the equality in step \((c)\) iff \(π_{t+1:T}\) is optimal (this is part of the induction hypothesis). Thus, the result is true for time \(t\) and, by the principle of induction, is true for all time. \(\Box\)
2 The notion of an information state
Now that we have obtained a dynamic programming decomposition, let’s try to simplify it. To do so, we define the notion of an information state.
A stochastic process \(\{Z_t\}_{t = 1}^T\), \(Z_t \in \ALPHABET Z\), is called an information state if \(Z_t\) be a function of \(H_t\) (which we denote by \(Z_t = φ_t(H_t)\)) and satisfies the following two properties:
P1. Sufficient for performance evaluation, i.e., \[ \EXP^π[ c_t(S_t, A_t) \mid H_t = h_t, A_t = a_t] = \EXP[ c_t(S_t, A_t) \mid Z_t = φ_t(h_t), A_t = a_t ] \]
P2. Sufficient to predict itself, i.e., for any Borel measurable subset \(B\) of \(\ALPHABET Z\), we have \[ \PR^π(Z_{t+1} \in B \mid H_t = h_t, A_t = a_t) = \PR(Z_{t+1} \in B \mid Z_t = φ_t(h_t), A_t = a_t). \]
Instead of (P2), the following sufficient conditions are easier to verify in some models:
P2a. Evolves in a statelike manner, i.e., there exist measurable functions \(\{ψ_t\}_{t=1}^T\) such that \[ Z_{t+1} = ψ_t(Z_t, Y_{t+1}, A_t). \]
P2b. Is sufficient for predicting future observations, i.e., for any Borel subset \(B\) of \(\ALPHABET Y\), \[ \PR^π(Y_{t+1} \in B  H_t = h_t, A_t = a_t) = \PR(Y_{t+1} \in B  Z_t = φ_t(h_t), A_t = a_t). \]
 Remark

The right hand sides of (P1) and (P2) as well as (P2a) and (P2b) do not depend on the choice of the policy \(π\).
 Proposition 1

(P2a) and (P2b) imply (P2).
Proof
For any Borel measurable subset \(B\) of \(\ALPHABET Z\), we have \[\begin{align*} \hskip 1em & \hskip 1em \PR(Z_{t+1} \in B \mid H_t = h_t, A_t = a_t) \stackrel{(a)}= \sum_{y_{t+1} \in \ALPHABET Y} \PR(Y_{t+1} = y_{t+1}, Z_{t+1} \in B \mid H_t = h_t, A_t = a_t ] \\ &\stackrel{(b)}= \sum_{y_{t+1} \in \ALPHABET Y} \IND\{ ψ_t(φ_t(h_t), y_{t+1}, a_t) \} \PR(Y_{t+1} = y_{t+1} \mid H_t = h_t, A_t = a_t) \\ &\stackrel{(c)}= \sum_{y_{t+1} \in \ALPHABET Y} \IND\{ ψ_t(φ_t(h_t), y_{t+1}, a_t) \} \PR(Y_{t+1} = y_{t+1} \mid Z_t = φ_t(h_t), A_t = a_t) \\ &\stackrel{(d)}= \PR(Z_{t+1} \in B \mid Z_t = φ_t(h_t), A_t = a_t) \end{align*}\] where \((a)\) follows from the law of total probability, \((b)\) follows from (P2a), \((c)\) follows from (P2b), and \((d)\) from the law of total probability.
2.1 Examples of an information state
We present three examples of information state here. See the Exercises for more examples.
 Example 1

The complete history \(H_t\) is an information state.
To establish this result, we define the belief state \(b_t \in Δ(\ALPHABET S)\) as follows: for any \(s \in \ALPHABET S\) \[ b_t(s) = \PR^π(S_t = s \mid H_t = h_t). \] The belief state is a function of the history \(h_t\). When we want to explicitly show the dependence of \(b_t\) on \(h_t\), we write it as \(b_t[h_t]\).
 Lemma 1

The belief state \(b_t\) does not depend on the policy \(π\).
This is an extremely important result which has wideranging implications in stochastic control. For a general discussion of this point, see Witsenhausen (1975).
Proof of Lemma 1
From the law of total probability and Bayes rule, we have \[\begin{equation} \label{eq:belief} \PR(s_t  y_{1:t}, a_{1:t1}) = \sum_{s_{1:t1}} \PR(s_{1:t}  y_{1:t}, a_{1:t1}) = \sum_{s_{1:t1}} \frac{\PR(s_{1:t}, y_{1:t}, a_{1:t1})} {\sum_{s'_{1:t}} \PR(s'_{1:t}, y_{1:t}, a_{1:t1})} \end{equation}\]
Now consider \[\begin{align*} \PR(s_{1:t}, y_{1:t}, a_{1:t1}) &= \PR(s_1) \PR(y_1  s_1) \IND\{ a_1 = π_1(y_1) \} \\ & \times \PR(s_2  s_1, a_1) \PR(y_2  s_2) \IND \{ a_2 = π_2(y_{1:2}, a_1)\} \\ & \times \cdots \\ & \times \PR(s_{t1}  s_{t2}, a_{t2}) \PR(y_{t1}  s_{t1}) \IND \{ a_{t1} = π_{t1}(y_{1:t1}, a_{1:t2}) \} \\ & \times \PR(s_{t}  s_{t1}, a_{t1}) \PR(y_{t}  s_{t}). \end{align*}\] Substitute the above expression in both the numerator and the denominator of \eqref{eq:belief}. Observe that the terms of the form \(\IND\{ a_s = π_s(y_{1:s}, a_{1:s1})\) are common to both the numerator and the denominator and cancel each other. Thus, \[\begin{equation} \label{eq:belieffn} \PR(s_t  y_{1:t}, a_{1:t1}) = \sum_{s_{1:t1}} \frac{ \prod_{s=1}^t \PR(s_s \mid s_{s1}, a_{s1}) \PR(y_s \mid s_s) } { \sum_{s'_{1:t}} \prod_{s=1}^t \PR(s'_s \mid s'_{s1}, a_{s1}) \PR(y_s \mid s'_s) }. \end{equation}\] None of the terms here depend on the policy \(π\). Hence, the belief state does not depend on the policy \(π\). \(\Box\)
Proof of Example 1
We will prove that \(Z_t = H_t\) satisfies properties (P1), (P2a), and (P2b).
P1. \(\displaystyle \EXP^π[ c_t(S_t, A_t)  H_t = h_t, A_t = a_t ] = \sum_{s_t \in \ALPHABET S} c_t(s_t, a_t) b_t[h_t](s_t)\).
P2a. \(H_{t+1} = (H_t, Y_{t+1}, A_t)\)
P2b. \(\displaystyle \PR^π(y_{t+1}  y_{1:t}, a_{1:t}) = \sum_{s_{t:t+1}} \PR(y_{t+1}  s_{t+1}) \PR( s_{t+1}  s_t, a_t) \PR(s_t  y_{1:t}, a_{1:t})\). Note that in the last term \(\PR^π(s_t  y_{1:t}, a_{1:t})\) we can drop \(a_t\) from the conditioning because it is a function of \((y_{1:t}, a_{1:t1})\). Thus, \[ \PR^π(s_t  y_{1:t}, a_{1:t}) = \PR^π(s_t  y_{1:t}, a_{1:t1}) = b_t[h_t](s_t).\] Thus, \(\displaystyle \PR^π(y_{t+1}  y_{1:t}, a_{1:t}) = \sum_{s_{t:t+1}} \PR(y_{t+1}  s_{t+1}) \PR( s_{t+1}  s_t, a_t) b_t[h_t](s_t)\).
 Example 2

The belief state \(b_t\) is an information state.
Proof
The belief state \(b_t\) is a function of the history \(h_t\). (The exact form of this function is given by \eqref{eq:belieffn}). In the proof of Example 1, we have already shown that \(b_t\) satisfies (P1) and (P2b). So, we only need to show that \(b_t\) satisfies (P2a). This can be shown as follows, which is a standard result in nonlinear filtering.
 Lemma 2

The belief state \(b_t\) updates in a state like manner. In particular, for any \(s_{t+1} \in \ALPHABET S\), we have \[ b_{t+1}(s_{t+1}) = \sum_{s_t \in \ALPHABET S} \frac{ \PR(y_{t+1}  s_{t+1}) \PR(s_{t+1}  s_t, a_t) b_t(s_t) } { \sum_{s'_{t:t+1}} \PR(y_{t+1}  s'_{t+1}) \PR(s'_{t+1}  s'_t, a_t) b_t(s'_t) }. \]
Proof of Lemma 2
For any \(s_{t+1} \in \ALPHABET S\), consider
\[\begin{align} b_{t+1}(s_{t+1}) &= \PR(s_{t+1}  y_{1:t+1}, a_{1:t}) \notag \\ &= \sum_{s_t} \PR(s_{t:t+1}  y_{1:t+1}, a_{1:t}) \notag \\ &= \sum_{s_t} \frac{ \PR(s_{t:t+1}, y_{t+1}, a_t  y_{1:t}, a_{1:t1}) } {\sum_{s'_{t:t+1}}\PR(s'_{t:t+1}, y_{t+1}, a_t  y_{1:t}, a_{1:t1}) }. \label{eq:update1} \end{align}\]
Now consider \[\begin{align} \hskip 1em & \hskip 1em \PR(s_{t:t+1}, y_{t+1}, a_t  y_{1:t}, a_{1:t1}) \notag \\ &= \PR(y_{t+1}  s_{t+1}) \PR(s_{t+1}  s_t, a_t) \IND\{ a_t = π_t(y_{1:t}, a_{1:t1}) \} \PR(s_t  y_{1:t}, a_{1_t1}) \notag \\ &= \PR(y_{t+1}  s_{t+1}) \PR(s_{t+1}  s_t, a_t) \IND\{ a_t = π_t(y_{1:t}, a_{1:t1}) \} b_t(s_t). \label{eq:belief2} \end{align}\] Substitute the above expression in both the numerator and the denominator of \eqref{eq:update1}. Observe that \(\IND\{ a_t = π_t(y_{1:t}, a_{1:t1}) \}\) is common to both the numerator and the denominator and cancels out. Thus, we get the result of the lemma. \(\Box\)
 Remark

Both the above information states are generic information states which work for all models. For specific models, it is possible to identify other information states as well. We present some examples of such an information state below.
 Example 3

An MDP is a special case of a POMDP where \(Y_t = S_t\). For an MDP \(Z_t = S_t\) is an information state.
Proof
We will show that \(Z_t = S_t\) satisfies (P1) and (P2).
(P1) is satisfied because the perstep cost is a function of the \((S_t, A_t)\). (P2) is equivalent to the control Markov property. \(\Box\)
3 Information state based dynamic program
The main feature of an information state is that one can always write a dynamic program based on an information state.
 Theorem 3

Let \(\{Z_t\}_{t=1}^T\) be any information state, where \(Z_t = φ_t(H_t)\). Recursively define value functions \(\{ \hat V_t \}_{t=1}^T\), where \(\hat V_t \colon \ALPHABET Z \to \reals\), as follows: \[ \hat V_{T+1}(z_{T+1}) = 0 \] and for \(t \in \{T, \dots, 1\}\): \[\begin{align} \hat Q_t(z_t, a_t) &= \EXP[ c_t(S_t, A_t) + \hat V_{t+1}(Z_{t+1}) \mid Z_t = z_t, A_t = a_t] \\ \hat V_t(z_t) &= \min_{a_t \in \ALPHABET A} \hat Q_t(z_t, a_t). \end{align}\] Then, we have the following: for any \(h_t\) and \(a_t\), \[\begin{equation} \label{eq:historyinfo} Q_t(h_t, a_t) = \hat Q_t(φ_t(h_t), a_t) \quad\text{and}\quad V_t(h_t) = \hat V_t(φ_t(h_t)). \end{equation}\] Any stategy \(\hat π = (\hat π_1, \dots, \hat π_T)\), where \(\hat π_t \colon \ALPHABET Z \to \ALPHABET A\), is optimal if and only if \[\begin{equation}\label{eq:infoverification} \hat π_t(z_t) \in \arg\min_{a_t \in \ALPHABET A} \hat Q_t(z_t, a_t). \end{equation}\]
Proof
As usual, we prove the result by backward induction. By constuction, Eq. \eqref{eq:historyinfo} is true at time \(T+1\). This forms the basis of induction. Now assume that \eqref{eq:historyinfo} is true at time \(t+1\) and consider the system at time \(t\). Then, \[\begin{align*} Q_t(h_t, a_t) &= \EXP[ c_t(S_t, A_t) + V_{t+1}(H_{t+1})  H_t = h_t, A_t = a_t ] \\ &\stackrel{(a)}= \EXP[ c_t(S_t, A_t) + \hat V_{t+1}( φ_t(H_{t+1}) )  H_t = h_t, A_t = a_t ] \\ &\stackrel{(b)}= \EXP[ c_t(S_t, A_t) + \hat V_{t+1}( φ_t(H_{t+1}) )  Z_t = φ_t(h_t), A_t = a_t ] \\ &\stackrel{(c)}= \hat Q_t(φ_t(h_t), a_t), \end{align*}\] where \((a)\) follows from the induction hypothesis, \((b)\) follows from the properties (P1) and (P2) of the information state, and \((c)\) follows from the definition of \(\hat Q_t\). This shows that the action value functions are equal. By minimizing over the actions, we get that the value functions are also equal. \(\Box\)
4 Belief state based dynamic program
As shown in Example 2, the belief state \(b_t\) is an information state. Therefore, Theorem 3 implies that we can write a dynamic program based on \(b_t\). This is an important and commonly used formulation, so we study it separately and present some properties of the value functions. The belief state based dynamic program is given by: \(V_{T+1}(b_{T+1}) = 0\) and for \(t \in \{T, \dots, 1\}\), \[ Q_t(b_t, a_t) = \EXP [ c_t(S_t, A_t) + V_{t+1}(B_{t+1}) \mid B_t = b_t, A_t = a_t ]. \] and \[ V_t(b_t) = \min_{a_t \in \ALPHABET A} Q_t(b_t, a_t). \]
Define \[ \PR(y_{t+1}  b_t, a_t) =
\sum_{s_{t:t+1}} \PR(y_{t+1}  s_{t+1}) \PR(s_{t+1}  s_t, a_t) b_t(s_t).
\] Then, the belief update expression in Lemma 2 can be written as: \[
b_{t+1}(s_{t+1}) =
\frac{ \PR(y_{t+1}  s_{t+1}) \sum_{s_t} \PR(s_{t+1}  s_t, a_t) b_t(s_t) }
{ \PR(y_{t+1}  b_t, a_t) }.
\] For the ease of notation, we write this expression as \(b_{t+1} = ψ(b_t, y_{t+1}, a_t)\).
\[\begin{align*}
Q_t(b_t, a_t) &= \sum_{s_t \in \ALPHABET S} c_t(s_t, a_t) b_t(s_t) \\
& \quad + \sum_{y_{t+1} \in \ALPHABET Y} \PR(y_{t+1}  b_t, a_t)
V_{t+1}( φ(b_t, y_{t+1}, a_t) ).
\end{align*}\]
A key property of the beliefstate based value functions is the following.
 Theorem 4

The beliefstate based value functions are piecewise linear and concave.
Proof
As usual, we prove the result using backward induction. For any \(a_T\), \[ Q_T(b_T, a_T) = \sum_{s_T \in \ALPHABET S} c_T(s_T, a_T) b_T(s_T) \] is linear in \(b_T\). Therefore, \[ V_T(b_T) = \min_{a_T \in \ALPHABET A} Q_T(b_T, a_T) \] is the minimum of a finite number of linear functions. Hence \(V_T(b_T)\) is piecewise linear and concave.
Now assume that \(V_{t+1}(b_{t+1})\) is piecewise linear and concave (PWLC). Any PWLC funcation can be represented as a minimum of a finite number of hyperplanes. Therefore, we can find a finite set of vectors \(\{ A_i \}_{i \in I}\) indexed by finite set \(I\) such that \[ V_{t+1}(b) = \min_{i \in I} \langle A_i, b \rangle. \]
We need to show that \(V_t(b_t)\) is piecewise linear and concave (PWLC). We first show that \(Q_t(b_t, a_t)\) is PWLC. For any fixed \(a_t\), the first term \(\sum_{s_t} c_t(s_t, a_t) b_t(s_t)\) is linear in \(b_t\). Now consider the second term: \[\begin{align*} \hskip 1em & \hskip 1em \sum_{y_{t+1} \in \ALPHABET Y} \PR(y_{t+1}  b_t, a_t) V_{t+1}( φ(b_t, y_{t+1}, a_t) ) \\ &= \sum_{y_{t+1} \in \ALPHABET Y} \PR(y_{t+1}  b_t, a_t) \min_{i \in I} \left\langle A_i, \frac{ \PR(y_{t+1}  s_{t+1}) \sum_{s_t} \PR(s_{t+1}  s_t, a_t) b_t(s_t) } { \PR(y_{t+1}  b_t, a_t) } \right\rangle \\ &= \sum_{y_{t+1} \in \ALPHABET Y} \min_{i \in I} \Big\langle A_i, \PR(y_{t+1}  s_{t+1}) \sum_{s_t} \PR(s_{t+1}  s_t, a_t) b_t(s_t) \Big\rangle \end{align*}\] which is the sum of PWLC functions of \(b_t\) and therefore PWLC in \(b_t\).
Thus, \(Q_t(b_t, a_t)\) is PWLC. Hence, \(V_t(b_t)\) which is the pointwise minimum of PWLC functions is PWLC. Hence, the result holds due to principle of induction. \(\Box\)
 Remark

Since the value function is PWLC, we can identify a finite index set \(I_t\), and a set of vectors \(\{ A^i_t \}_{i \in I_t}\) such that \[ V_t(b) = \min_{i \in I_t} \langle A^i_t, b \rangle. \] Smallwood and Sondik (1973) presented a “onepass” algorithm to recursively compute \(I_t\) and \(\{ A^i_t \}_{i \in I_t}\) which allows us to exactly compute the value function. Various efficient refinements of these algorithms have been presented in the literature, e.π., the linearsupport algorithm (Cheng 1988), the witness algorithm (Cassandra et al. 1994), incremental pruning (Zhang and Liu 1996; Cassandra et al. 1997), duality based approach (Zhang 2009), and others. See https://pomdp.org/ for an accessible introduction to these algorithms.
Exercises
Consider an MDP where the state space \(\ALPHABET S\) is a symmetric subset of integers of the form \(\{L, L + 1, \dots, L 1 , L\}\) and the action space \(\ALPHABET A\) is discrete. Suppose the transition matrix \(P(a)\) and the cost function \(c_t(s,a)\) satisfy properties (A1) and (A2) of Exercise 1 of Monotone MDPs. Show that \(Z_t = S_t\) is an information state.
Consider a linear system with state \(x_t \in \reals^n\), observations \(y_t \in \reals^p\), and action \(u_t \in \reals^m\). Note that we will follow the standard notation of linear systems and denote the system variables by lower case letters \((x,u)\) rather than upper case letter \((S,A)\). The dynamics of the system are given by \[\begin{align*} x_{t+1} &= A x_t + B u_t + w_t \\ y_t &= C x_t + n_t \end{align*}\] where \(A\), \(B\), and \(C\) are matrices of appropriate dimensions. The perstep cost is given by \[ c(x_t, u_t) = x_t^\TRANS Q x_t + u_t^\TRANS R u_t, \] where \(Q\) is a positive semidefinite matrix and \(R\) is a positive definite matrix. We make the standard assumption that the primitive random variables \(\{s_1, w_1, \dots, w_T, n_1, \dots, n_T \}\) are independent.
Show that if the primitive variables are Guassian, then the conditional estimate of the state \[ \hat x_t = \EXP[ x_t  y_{1:t}, u_{1:t1} ] \] is an information state.
Consider a machine which can be in one of \(n\) ordered state where the first state is the best and the last state is the worst. The production cost increases with the state of the machine. The state evolves in a Markovian manner. At each time, an agent has the option to either run the machine or stop and inspect it for a cost. After inspection, the agent may either repair the machine (at a cost that depends on the state) or replace it (at a fixed cost). The objective is to identify a maintenance policy to minimize the cost of production, inspection, repair, and replacement.
Let \(τ\) denote the time of last inspection and \(S_τ\) denote the state of the machine after inspection, repair, or replacement. Show that \((S_τ, tτ)\) is an information state.
References
The discussion in this section is taken from Subramanian and Mahajan (2019).
This entry was last updated on 17 Dec 2021 and posted in POMDP and tagged pomdp, belief state.