17 Introduction
So far, we have considered a setup where the decision maker perfectly observes the state of the system. In many applications, the decision maker may not directly observe the state of the system but only observe a noisy version of it. Such systems are modeled as partially observable Markov decision processes (POMDPs). We will describe the simplest model of POMDPs, which builds upon the model of MDPs descibed earlier.
We assume that the system has a state \(S_t \in \ALPHABET S\), control input \(A_t \in \ALPHABET A\), and process noise \(W_t \in \ALPHABET W\). The state evolves as \[\begin{equation} \label{eq:state} S_{t+1} = f_t(S_t, A_t, W_t) \end{equation}\] However, unlike the MDP setup, the assumption is that the decision maker does not observe \(S_t\); rather, the observation of the decision maker at time \(t\) is given by \[\begin{equation} \label{eq:obs} Y_t = \ell_t(S_t, N_t) \end{equation}\] where \(Y_t \in \ALPHABET Y\) is the observation and \(N_t \in \ALPHABET N\) is called the observation noise. As in the case of MDPs, we assume that the primitive random varaibles \((S_1, W_1, \dots, W_T\), \(N_1, \dots, N_T)\) are defined on a common probability space and are mutually independent. This assumption is critical for the results to go through.
As in the case of MDPs, we assume that the controller can be as sophisticated as we want. It can analyze the entire history of observations and control actions to choose the current control action. Thus, the control action can be written as \[ A_t = π_t(Y_{1:t}, A_{1:t-1}). \]
At each time, the system incurs a cost \(c_t(S_t, A_t)\) which depends on the current state and the current action. The system operates for a finite horizon \(T\) and incurs a total cost \[ \sum_{t=1}^T c_t(S_t, A_t). \]
Given the above system model, we want to choose a control strategy \(π = (π_1, \dots, π_T)\) to minimize the expected total cost \[ J(π) := \EXP\Bigl[ \sum_{t=1}^T c_t(S_t, A_t) \Bigr]. \] How should we proceed?
Note that the only difference from the MDP model is decision maker observes \(Y_t\) instead of \(S_t\). Apart from this, the other modeling assumptions are the same. So, the conceptual difficulties of the model are the same as that of MDPs:
Recall that for MDPs, we first showed that there is no loss of optimality in restricting attention to Markov strategies. That structural result was instrumental in developing an efficient search algorithm (dynamic programming). So, what is the equivalent result for POMDPs?
17.1 History dependent dynamic program
Our first step to develop an efficient dynamic programming decomposition is to simply ignore efficiency and develop a dynamic programming decomposition. We start by deriving a recursive formula to compute the performance of a generic history dependent strategy \(π = (π_1, \dots, π_T)\).
Performance of history-dependent strategies
Let \(H_t = (Y_{1:t}, A_{1:t-1})\) denote all the information available to the decision maker at time \(t\). Thus, given any history dependent strategy \(π\), we can write \(A_t = π_t(H_t)\). Define the cost-to-go functions as follows: \[ J_t(h_t; π) = \EXP^π\biggl[ \sum_{s=t}^T c_s(S_s, A_s) \biggm| H_t = h_t \biggr]. \] Note that \(J_t(h_t; π)\) only depends on the future strategy \((π_t, \dots, π_T)\). These functions can be computed recursively as follows: \[\begin{align*} J_t(h_t; π) &= \EXP^π\biggl[ \sum_{s=t}^T c_s(H_s, π_s(H_s)) \biggm| H_t = h_t \biggr] \\ &\stackrel{(a)}= \EXP^π \biggl[ c_t(h_t, π_t(h_t)) + \EXP^π\biggl[ \sum_{s=t+1}^T c_s(S_s, π_s(S_s)) \biggm| H_{t+1} \biggr] \biggm| H_t = h_t \biggr] \\ &= \EXP^π[ c_t(h_t, π_t(h_t)) + J_{t+1}(H_{t+1}; π) \mid H_t = h_t ], \end{align*}\] where \((a)\) follows from the towering property of conditional expectation and the fact that \(H_t \subseteq H_{t+1}\).
Thus, we can use the following dynamic program to recursively compute the performance of a history-dependent strategy: \(J_{T+1}(h_{T+1}) = 0\) and for \(t \in \{T, \dots, 1\}\), \[ J_t(h_t; π) = \EXP^π [ c_t(h_t, π_t(h_t)) + J_{t+1}(H_{t+1}; π) \mid H_t = h_t ]. \]
History-dependent dynamic programming decomposition
We can use the above recursive formulation for performance evaluation to derive a history-dependent dynamic program.
Theorem 17.1 Recursively define _value functions \(\{V_t\}_{t = 1}^{T+1}\), where \(V_t \colon \ALPHABET H_t \to \reals\) as follows: \[\begin{equation} V_{T+1}(h_{T+1}) = 0 \end{equation}\] and for \(t \in \{T, \dots, 1\}\): \[\begin{align} Q_t(h_t, a_t) &= \EXP[ c_t(S_t, a_t) + V_{t+1}(H_{t+1}) \mid H_t = h_t, A_t = a_t ] \\ V_t(h_t) &= \min_{a_t \in \ALPHABET A} Q_t(h_t, a_t) \end{align}\] Then, a history-dependent policy \(π\) is optimal if and only if it satisfies \[\begin{equation} \label{eq:history-verification} π_t(h_t) \in \arg \min_{a_t \in \ALPHABET A} Q_t(h_t, a_t). \end{equation}\]
The proof idea is similar to the proof for MDPs. Instead of proving the above result, we prove a related result.
Theorem 17.2 (The comparison principle) For any history-dependent strategy \(π\) \[ J_t(h_t; π) \ge V_t(h_t) \] with equality at \(t\) if and only if the future straegy \(π_{t:T}\) satisfies the verification step \eqref{eq:history-verification}.
Note that the comparison principle immediately implies that the strategy obtained using dynamic program of Theorem 17.1 is optimal. The proof of the comparison principle is almost identical to the proof for MDPs.
17.2 The notion of an information state
Now that we have obtained a dynamic programming decomposition, let’s try to simplify it. To do so, we define the notion of an information state.
Instead of (P2), the following sufficient conditions are easier to verify in some models:
Proposition 17.1 : (P2a) and (P2b) imply (P2).
17.3 Examples of an information state
We start by define the belief state \(b_t \in Δ(\ALPHABET S)\) as follows: for any \(s \in \ALPHABET S\) \[ b_t(s) = \PR^π(S_t = s \mid H_t = h_t). \] The belief state is a function of the history \(h_t\). When we want to explicitly show the dependence of \(b_t\) on \(h_t\), we write it as \(b_t[h_t]\).
Lemma 17.1 The belief state \(b_t\) does not depend on the policy \(π\).
Lemma 17.2 The belief state \(b_t\) updates in a state like manner. In particular, for any \(s_{t+1} \in \ALPHABET S\), we have \[ b_{t+1}(s_{t+1}) = \sum_{s_t \in \ALPHABET S} \frac{ \PR(y_{t+1} | s_{t+1}) \PR(s_{t+1} | s_t, a_t) b_t(s_t) } { \sum_{s'_{t:t+1}} \PR(y_{t+1} | s'_{t+1}) \PR(s'_{t+1} | s'_t, a_t) b_t(s'_t) }. \]
Now, we present three examples of information state here. See the Exercises for more examples.
Example 17.1 The complete history \(H_t\) is an information state.
Example 17.2 The belief state \(b_t\) is an information state.
Example 17.3 An MDP is a special case of a POMDP where \(Y_t = S_t\). For an MDP \(Z_t = S_t\) is an information state.
17.4 Information state based dynamic program
The main feature of an information state is that one can always write a dynamic program based on an information state.
Theorem 17.3 Let \(\{Z_t\}_{t=1}^T\) be any information state, where \(Z_t = φ_t(H_t)\). Recursively define value functions \(\{ \hat V_t \}_{t=1}^T\), where \(\hat V_t \colon \ALPHABET Z \to \reals\), as follows: \[ \hat V_{T+1}(z_{T+1}) = 0 \] and for \(t \in \{T, \dots, 1\}\): \[\begin{align} \hat Q_t(z_t, a_t) &= \EXP[ c_t(S_t, A_t) + \hat V_{t+1}(Z_{t+1}) \mid Z_t = z_t, A_t = a_t] \\ \hat V_t(z_t) &= \min_{a_t \in \ALPHABET A} \hat Q_t(z_t, a_t). \end{align}\] Then, we have the following: for any \(h_t\) and \(a_t\), \[\begin{equation} \label{eq:history-info} Q_t(h_t, a_t) = \hat Q_t(φ_t(h_t), a_t) \quad\text{and}\quad V_t(h_t) = \hat V_t(φ_t(h_t)). \end{equation}\] Any strategy \(\hat π = (\hat π_1, \dots, \hat π_T)\), where \(\hat π_t \colon \ALPHABET Z \to \ALPHABET A\), is optimal if and only if \[\begin{equation}\label{eq:info-verification} \hat π_t(z_t) \in \arg\min_{a_t \in \ALPHABET A} \hat Q_t(z_t, a_t). \end{equation}\]
17.5 Belief state based dynamic program
As shown in Example 17.2, the belief state \(b_t\) is an information state. Therefore, Theorem 17.3 implies that we can write a dynamic program based on \(b_t\). This is an important and commonly used formulation, so we study it separately and present some properties of the value functions. The belief state based dynamic program is given by: \(V_{T+1}(b_{T+1}) = 0\) and for \(t \in \{T, \dots, 1\}\), \[ Q_t(b_t, a_t) = \EXP [ c_t(S_t, A_t) + V_{t+1}(B_{t+1}) \mid B_t = b_t, A_t = a_t ]. \] and \[ V_t(b_t) = \min_{a_t \in \ALPHABET A} Q_t(b_t, a_t). \]
Define \[ \PR(y_{t+1} | b_t, a_t) =
\sum_{s_{t:t+1}} \PR(y_{t+1} | s_{t+1}) \PR(s_{t+1} | s_t, a_t) b_t(s_t).
\] Then, the belief update expression in Lemma 17.2 can be written as: \[
b_{t+1}(s_{t+1}) =
\frac{ \PR(y_{t+1} | s_{t+1}) \sum_{s_t} \PR(s_{t+1} | s_t, a_t) b_t(s_t) }
{ \PR(y_{t+1} | b_t, a_t) }.
\] For the ease of notation, we write this expression as \(b_{t+1} = ψ(b_t, y_{t+1}, a_t)\).
\[\begin{align*}
Q_t(b_t, a_t) &= \sum_{s_t \in \ALPHABET S} c_t(s_t, a_t) b_t(s_t) \\
& \quad + \sum_{y_{t+1} \in \ALPHABET Y} \PR(y_{t+1} | b_t, a_t)
V_{t+1}( φ(b_t, y_{t+1}, a_t) ).
\end{align*}\]
A key property of the belief-state based value functions is the following.
Theorem 17.4 The belief-state based value functions are piecewise linear and concave.
Exercises
Exercise 17.1 Consider an MDP where the state space \(\ALPHABET S\) is a symmetric subset of integers of the form \(\{-L, -L + 1, \dots, L -1 , L\}\) and the action space \(\ALPHABET A\) is discrete. Suppose the transition matrix \(P(a)\) and the cost function \(c_t(s,a)\) satisfy properties (A1) and (A2) of Exercise 7.6. Show that \(Z_t = |S_t|\) is an information state.
Exercise 17.2 Consider a linear system with state \(x_t \in \reals^n\), observations \(y_t \in \reals^p\), and action \(u_t \in \reals^m\). Note that we will follow the standard notation of linear systems and denote the system variables by lower case letters \((x,u)\) rather than upper case letter \((S,A)\). The dynamics of the system are given by \[\begin{align*} x_{t+1} &= A x_t + B u_t + w_t \\ y_t &= C x_t + n_t \end{align*}\] where \(A\), \(B\), and \(C\) are matrices of appropriate dimensions. The per-step cost is given by \[ c(x_t, u_t) = x_t^\TRANS Q x_t + u_t^\TRANS R u_t, \] where \(Q\) is a positive semi-definite matrix and \(R\) is a positive definite matrix. We make the standard assumption that the primitive random variables \(\{s_1, w_1, \dots, w_T, n_1, \dots, n_T \}\) are independent.
Show that if the primitive variables are Guassian, then the conditional estimate of the state \[ \hat x_t = \EXP[ x_t | y_{1:t}, u_{1:t-1} ] \] is an information state.
Exercise 17.3 Consider a machine which can be in one of \(n\) ordered state where the first state is the best and the last state is the worst. The production cost increases with the state of the machine. The state evolves in a Markovian manner. At each time, an agent has the option to either run the machine or stop and inspect it for a cost. After inspection, the agent may either repair the machine (at a cost that depends on the state) or replace it (at a fixed cost). The objective is to identify a maintenance policy to minimize the cost of production, inspection, repair, and replacement.
Let \(τ\) denote the time of last inspection and \(S_τ\) denote the state of the machine after inspection, repair, or replacement. Show that \((S_τ, t-τ)\) is an information state.
Notes
The discussion in this section is taken from Subramanian et al. (2022). Information state may be viewed as a generalization of the traditional notion of state Nerode (1958), which is defined as a statistic (i.e., a function of the observations) sufficient for input-output mapping. In contrast, we define an information state as a statistic sufficient for performance evaluation (and, therefore, for dynamic programming). Such a definition is hinted in Witsenhausen (1976). The notion of information state is also related to sufficient statistics for optimal control defined in Striebel (1965) for systems with state space models.
As far as we are aware, the informal definition of information state was first proposed by Kwakernaak (1965) for adaptive control systems. Formal definitions for linear control systems were given by Bohlin (1970) for discrete time systems and by Davis and Varaiya (1972) for continuous time systems. Kumar and Varaiya (1986) define an information state as a compression of past history which satisfies property (P2a) but do not formally show that such an information state always leads to a dynamic programming decomposition.