31 Risk Sensitive MDPs
risk sensitive, dynamic programming
31.1 Finite horizon setup
Consider an MDP with state space \(\ALPHABET S\), action space \(\ALPHABET A\), per-step cost \(c \colon \ALPHABET S × \ALPHABET A \to \reals\), and controlled transition matrix \(P\). However, instead of the risk neutral optimization criteria that we consider previously, we consider a risk-sensitive objective with an entropic risk measure. For the ease of notation, we define the (risk-aware) certainty equivalent cost of a random return \(X\) as \[ \mathcal{R}[X] = \frac{1}{θ} \log \EXP [ \exp(θX) ] \] which corresponds to a risk-averse objective when \(θ > 0\) and a risk-seeking objective when \(θ < 0\).
Thus, the performance of any (possibly non-Markovian) strategy \(π = (π_1, \dots, π_T)\) is given by \[ J_θ(π) = \mathcal{R}\biggl[ \sum_{t=1}^T c(S_t, A_t) \biggr] = \frac{1}{θ} \log \EXP\Bigl[ \exp\Bigl( θ \sum_{t=1}^T c(S_t, A_t) \Bigr) \Bigr]. \]
Recall that this is the effective cost for an exponential disutility function. Note that \(\bar J_θ(π) = \exp(θ J_θ(π))\) may be viewed as a multiplicative cost. Based on the argument for multiplicative cost, we can write the dynamic program for \(\bar J_θ(π)\). as follows.
Initialize \(\bar V_{T+1}(s) = 1\) and recursively compute
\[\begin{align} \bar Q_t(s,a) &= \exp(θ c(s,a)) \sum_{s' \in \ALPHABET S} P_{ss'}(a) \bar V_{t+1}(s'), \label{eq:DP-1}\\ \bar V_t(s) &= \begin{cases} \min_{a \in \ALPHABET{A}} \bar Q_t(s,a), & θ > 0 \\ \max_{a \in \ALPHABET{A}} \bar Q_t(s,a), & θ < 0 \end{cases} \label{eq:DP-2} \end{align}\]
The dynamic program of \eqref{eq:DP-1}–\eqref{eq:DP-2} can be written in certainty-equivalent form with \[ V_t(s) = \frac{1}{θ} \log \bar V_t(s) \] Then the above dynamic program simplies as follows:
Initialize \(V_{T+1}(s) = 0\) and recursively compute
\[\begin{align} Q_t(s,a) &= c(s,a) + \frac{1}{θ} \log \sum_{s' \in \ALPHABET S} P(s'|s,a) \exp(θ V_{t+1}(s')), \label{eq:DP-3} \\ V_t(s) &= \min_{a \in \ALPHABET A} Q_t(s,a). \label{eq:DP-4} \end{align}\]
Observe that the two different cases in \(\eqref{eq:DP-2}\) simplies to a single expression in \(\eqref{eq:DP4}\), irrespective of the sign of \(θ\). This is because when \(θ < 0\), in the scaled recursion, we take the maximum of \(\bar Q_t(s,a)\), but \(\frac{1}{θ}\log(⋅)\) is a decreasing function, because \(\frac{1}{θ} < 0\), so we take the minimum of \(Q_t(s,a)\).
31.2 Example: inventory management with entropic risk
Let’s revisit the inventory management model with entropic risk. The solution for different values of risk-sensitivity parameter \(θ\) are shown in Figure 31.1. Recall that \(θ = 0\) corresponds to the risk neutral setting. As before, we assume that the deman is Binomial \((50, 0.4)\) with parameters \(c_h = 2\), \(c_s = 5\), \(p = 1\).
Observe that risk averse agents (\(θ > 0\)) act as if high shortage scenarios matter more and therefore order more (higher base-stock level) than the risk-neutral case (\(θ = 0\)). On the other hand, risk seeking agents (\(θ < 0\)) act as if shortage scenarios matter less and therefore order less (lower base-stock level) than risk-neutral case.
31.3 Infinite horizon average cost setup
The objective for infinite-horizon average cost setup is to minimize: \[ J^* = \inf_{π} \limsup_{T \to ∞} \frac{1}{θT} \EXP \Bigl[ \exp\Bigl( θ \sum_{t=1}^T c(S_t, A_t) \Bigr) \Bigr]. \]
When the matrix \(P(a)\) is irreducible and aperiodic for each \(a\), the dynamic program for average cost MDP can be written as follows:
Theorem 31.1 Suppose there exist constant \(J\) and a bounded function \(v \colon \ALPHABET S \to \reals\) that satisfy the following fixed point equation \[\begin{equation} \label{eq:avg} J + v(s) = \min_{a \in \ALPHABET A} \biggl\{ c(s,a) + \frac{1}{θ} \log \sum_{s' \in \ALPHABET S} P_{ss'}(a) \exp\bigl( θ\, v(s') \bigr) \biggr\}. \end{equation}\] Furthermore, let \(π^* \colon \ALPHABET S \to \ALPHABET A\) denote the policy that achieves the arg min in \(\eqref{eq:avg}\). Then, \(J\) is the optimal risk-sensitive performance and the time-homogeneous policy \(π^*\) achieves that performance.
The dynamic program of \(\eqref{eq:avg}\) can be written in an alternative form using a Legendre-type transformation and the duality relationship between relative entropy function and the log-sum-exp.
Let \(Δ(\ALPHABET S)\) denote the set of probability vectors on \(\ALPHABET S\). Then, for any \(ν \in Δ(\ALPHABET S)\), the relative entropy \(I(\cdot \| ν) \colon Δ(\ALPHABET S) \to \reals \cup \{+∞\}\) is given by \[ I(μ \| ν) = \begin{cases} \sum_{s \in \ALPHABET S} \log(λ(s)) μ(s), & \text{if } μ \ll ν, \\ +∞, & \text{otherwise}. \end{cases} \] where the likelihood ratio \[ λ(s) = \begin{cases} \frac{μ(s)}{ν(s)}, & \text{if } ν(s) \neq 0, \\ 1, & \text{otherwise}. \end{cases}\]
Lemma 31.1 Let \(w \colon \ALPHABET S \to \reals\) be a bounded function and \(ν \in Δ(\ALPHABET S)\). Then, \[ \log \sum_{s \in \ALPHABET S} ν(s) \exp( θw(s)) = \inf_{μ \in Δ(\ALPHABET S)} \Bigl\{ \sum_{s \in \ALPHABET S} μ(s) w(s) + \frac{1}{θ} I(μ \| ν) \Bigr\}, \] where the infimum is attained at the unique probability measure \(μ^*\) given by \[ μ^*(s) = \frac{e^{θw(s)}}{\sum_{s \in \ALPHABET S} e^{θw(s)}ν(s)} ν(s). \]
Such a dual-representation is a fundamental property of all coherent risk measures and not just the entropic risk measure that we are working with here. See, for example, Föllmer and Schied (2010).
Using Lemma 31.1, the dynamic program of \eqref{eq:avg} can be written as \[\begin{equation} J + v(s) = \min_{a \in \ALPHABET A} \inf_{μ \in Δ(\ALPHABET S)} \Bigl\{ c(s,a) + \sum_{s' \in \ALPHABET S} μ(s') v(s') + \frac{1}{θ} I(μ \| P(\cdot | s, a) ) \Bigr\}. \end{equation}\] This equation corresponds to the Issacs equation associated with a stochastic dynamic game with average cost-per unit time criterion.
31.4 Infinite horizon discounted cost setup
Whittle (2002) has a discussion on why discounted cost setup for risk-sensitive MDP is tricky and the solution depends on the interpretation of discounting. For a dynamic programming decomposition for discounted entropic risk, see Lin Hau et al. (2023).
Also see Huang et al. (2024) for different interpretations of discounting for general risk measures.
Notes
The basic risk-sensitive MDP was first considered in Howard and Matheson (1972). See Howard and Matheson (1972) for a policy iteration algorithm. It is also mentioned that a version of this result is presented in Bellman’s book. See Hernandez-Hernández and Marcus (1996) and Hernández-Hernández (1999) for a detailed treatment of the average cost case.