8 Some simplifying structures
8.1 Post-decision state
In the inventory management example, we simplified the dynamic program using a post-decision state. This idea works in more general settings as long as the noise in the dynamics does not depend on the action. Consider, for example, the game of tetris. The random shape that appears next does not depend on how the player moves.
We can model such situations with a post-decision state \(Z_t \in \ALPHABET Z\) such that the dynamics can be split into two parts: \[\begin{equation}\label{eq:pd-dynamics} f(s,a,w) = f^{\rm stoc} ( f^{\rm ctrl}_t(s,a), w ) \end{equation}\] where \(f^{\rm ctrl} \colon \ALPHABET S × \ALPHABET A \to \ALPHABET Z\) is the controlled part of the dynamics that depend on the action but not the noise and \(f^{\rm stoc} \colon \ALPHABET Z \times \ALPHABET W \to \ALPHABET S\) is the _ (exogenous) stochastic_ part of the dynamics that depend on the noise but not the action.
An example of such dynamics is inventory management, where the dynamics were \[ f(s,a,w) = s + a - w. \] A non-example is optimal gambling, where the dynamics \[ f(s,a,w) = s + a w \] cannot be written in the structured form described above (or more precisely, the only way to write it in the above form is to take \(z = (s,a)\), which does not lead to any simplification).
If the model is such that the cost depends on the next state, then it is assumed that the per-step cost can also be decomposed as \[\begin{equation}\label{eq:pd-cost} c(s,a,s') = c^{\rm ctrl}(s,a) + c^{\rm stoc}( f^{\rm ctrl}_t(s,a), s'). \end{equation}\]
When a model satisfies \(\eqref{eq:pd-dynamics}\) and \(\eqref{eq:pd-cost}\), the dynamic programming equation can be simplified as follows: We initialize as usual: \[ V^{\rm \det}_{T+1}(s) = 0, \quad \forall s \in \ALPHABET S \] but simplify the recursive step as follows. For \(t \in \{T, \dots, 1\}\), \[\begin{align*} V^{\rm stoc}_t(z) &= \EXP[ c^{\rm stoc}(z, f^{\rm stoc}(z,W)) + V^{\rm ctrl}_{t+1}( f^{\rm stoc}(z,W) ) ], \quad \forall z \in \ALPHABET Z,\\ Q^{\rm ctrl}_t(s,a) &= c^{\rm ctrl}(s,a) + V^{\rm stoc}_t( f^{\rm ctrl}(s,a) ), \quad \forall s,a \in \ALPHABET S \times \ALPHABET A, \\ V^{\rm ctrl}_t(s) &= \min_{a \in \ALPHABET A} Q^{\rm ctrl}_t(s,a), \quad \forall s \in \ALPHABET S. \end{align*}\]
There are two advantages of post-decision state models:
The optimization step \[ \min_{a \in \ALPHABET A} \bigl\{ c^{\rm ctrl}(s,a) + V^{\rm stoc}_t( f^{\rm ctrl}(s,a) ) \bigr\} \] does not require any expectations, so it can be solved using standard optimization tools.
The complexity of computing the action-value function at time \(t\) is \(\mathcal{O}(\ABS{\ALPHABET Z} \ABS{\ALPHABET W} + \ABS{\ALPHABET S} \ABS{\ALPHABET A})\), which is typically much smaller than the standard complexity of \(\mathcal{O}(\ABS{\ALPHABET S}\ABS{\ALPHABET A}\ABS{\ALPHABET W})\). As long as post-decision state is small (typically \(\ABS{\ALPHABET Z} \approx \ABS{\ALPHABET S}\)), we avoid the multiplicative blow-up of \(\ABS{\ALPHABET S}\ABS{\ALPHABET A}\ABS{\ALPHABET W}\).
8.2 A slight generalization
There is an alternative way to look at the above decomposition (which slightly generalizes the model). We present this alternative view using the probabilistic model.
Let \(\ABS{\ALPHABET S} = n\) and \(\ABS{\ALPHABET A} = m\). Suppose there exists a post-decision state such that \(\ABS{\ALPHABET Z} = k\) such that for all \(a \in \ALPHABET A\), \[ P(s'|s,a) = \sum_{z \in \ALPHABET Z} P^{\rm stoc}(s'|z) P^{\rm ctrl}(z|s,a) \] or, equivalently, \[ P(a) = P^{\rm ctrl}(a) P^{\rm stoc}. \] The key assumption here is that \(P^{\rm stoc}\) is same of all actions. In the previous section, we were effectively assuming that \(P^{\rm ctrl}(a)\) is deterministic (i.e., in each row of the matrix, one entry is one and all other entries are zero). As we see below, this assumption is not necessary.
With this notation, we can write the recursive step of the dynamic program as follows: \[\begin{align*} V^{\rm stoc}_t(z) &= \sum_{s' \in \ALPHABET S} P^{\rm stoc}(s'|z) \bigl[ c^{\rm stoc}(z, s') + V^{\rm ctrl}_{t+1}(s') \bigr], \quad \forall z \in \ALPHABET Z,\\ Q^{\rm ctrl}_t(s,a) &= c^{\rm ctrl}(s,a) + \sum_{z \in \ALPHABET Z} P^{\rm ctrl}(z|s,a) V^{\rm stoc}_t(z), \quad \forall s,a \in \ALPHABET S \times \ALPHABET A, \\ V^{\rm ctrl}_t(s) &= \min_{a \in \ALPHABET A} Q^{\rm ctrl}_t(s,a), \quad \forall s \in \ALPHABET S. \end{align*}\]
The complexity of this step is \(\mathcal{O}(nk + nmk)\), in contrast to \(\mathcal{O}(n^2m)\) of standard DP.
8.3 MDPs with side information
In many scenarios, the decision maker observes some “side information” \(X_t\) that affects the current cost but does not influence the future evolution of the “core” state \(S_t\). For example, consider an inventory management model where the price fluctuates over time but the user knows the price before placing an order.
We can model such scenarios as an MDP where the state is the pair \((S_t, X_t)\). We assume the dynamics factorize as: \[P(S_{t+1}, X_{t+1} | S_t, X_t, A_t) = P(S_{t+1} | S_t, A_t) P^{\rm info}(X_{t+1} | S_{t+1})\] where \(P^{\rm info}(x|s)\) is the probability of observing side information \(x\) given the state is \(s\). The per-step cost is given by \(c(s_t, x_t, a_t)\).
The standard DP for this model is: \[V_{T+1}(s,x) = 0\] and for \(t \in \{T, \dots, 1\}\), \[V_t(s,x) = \min_{a \in \mathcal{A}} \left[ c(s,x,a) + \sum_{s_{+} \in \mathcal{S}} \sum_{x_{+} \in \mathcal{X}} P(s_{+}|s,a) P^{\rm info}(x_{+}|s_{+}) V_{t+1}(s_{+}, x_{+}) \right].\].
Since \(X_t\) doesn’t affect the transition of \(S_t\), we can simplify the state space by defining an averaged value function \(V^{\rm info}_t(s)\), which represents the expected future cost at time \(t\) before \(X_t\) is observed: \[V^{\rm info}_t(s) = \sum_{x \in \mathcal{X}} P^{\rm info}(x|s) V_t(s,x).\]
With this notation, we can write the recursive step of the dynamic program as follows:
\[ V^{\rm info}_t(s) = \sum_{x \in \ALPHABET X} P^{\rm info}(x|s) \min_{a \in \ALPHABET A} \left[ c(s,x,a) + \sum_{s_{+} \in \ALPHABET S} P(s_{+}|s,a) V^{\rm info}_{t+1}(s_{+}) \right] \]
If the inner optimization problem can be solved efficiently, then we only need to store \(V^{\rm info}(\cdot)\). This is particularly efficient when \(\ALPHABET X\) is high-dimensional.
Notes
The idea of post-decision state is due to Van Roy et al. (1997) and is sometimes referred to as afterstates in the RL literature. Sutton and Barto (2018). Post-decision states are also useful in approximate dynamic programming; see Powell (2011).