ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

Theory: Risk Sensitive MDP

1 Finite horizon setup

Consider the matrix formulation of an MDP with state space \(\ALPHABET X\), action space \(\ALPHABET U\), per-step cost \(c \colon \ALPHABET X × \ALPHABET U \to \reals\), and controlled transition matrix \(P\). However, instead of the risk neutral optimization criteria that we consider previously, we consider a risk-sensitive objective. In particular, the performance of any (possibly non-Markovian) strategy \(g = (g_1, \dots, g_T)\) is given by \[ \bar J_θ(g) = \frac{1}{θ} \log \EXP\Bigl[ \exp\Bigl( θ \sum_{t=1}^T c(X_t, U_t) \Bigr) \Bigr]. \]

Recall that this is the effective cost for an exponential disutility function. Note that \(J_θ(g) = \exp(θ \bar J(g))\) may be viewed as a multiplicative cost. Based on the argument for multiplicative cost, we can write the dynamic program for \(J_θ(g)\) as follows.

Dynamic program. Initialize \(V_{T+1}(x) = 1\) and recursively compute

\[ \begin{align} Q_t(x,u) &= \exp(θ c(x,u)) \sum_{y \in \ALPHABET X} P_{xy}(u) V_{t+1}(y), \label{eq:DP-1}\\ V_t(x) &= \min_{u \in \ALPHABET X} Q_t(x,u). \label{eq:DP-2} \end{align} \]

Or, equivalently, working with the effective cost value function:

\[ \begin{align} Q_t(x,u) &= \exp(θ c(x,u)) \sum_{y \in \ALPHABET X} P_{xy}(u) \exp(θ \bar V_{t+1}(y)), \\ \bar V_t(x) &= \frac{1}{θ} \log \bigl( \min_{u \in \ALPHABET X} Q_t(x,u) \bigr). \end{align} \]

The dynamic program of \eqref{eq:DP-1}–\eqref{eq:DP-2} can be made to resemble the standard dynamic program by defining the disutility matrix \[ D_{xy}(u) = \exp(θ c(x,u)) P_{xy}(u). \] Note that the elements of \(D\) are non-negative. The expression \eqref{eq:DP-1} can then be written in “standard” form: \[ Q_{t+1}(x) = \sum_{y \in \ALPHABET X} D_{xy}(u) V_{t+1}(y). \]

2 Infinite horizon average cost setup

The objective for infinite-horizon average cost setup is to minimize: \[ J^* = \inf_{g} \limsup_{T \to ∞} \frac{1}{θT} \EXP \Bigl[ \exp\Bigl( θ \sum_{t=1}^T c(X_t, U_t) \Bigr) \Bigr]. \]

When the matrix \(P(u)\) is irreducible and aperiodic for each \(u\), the dynamic program for average cost MDP can be written as follows:

Theorem 1

Suppose there exist constant \(J\) and a bounded function \(v \colon \ALPHABET X \to \reals\) such that \[ \begin{equation} \label{eq:avg} \exp(θ (J + v(x))) = \min_{u \in \ALPHABET U} \sum_{y \in \ALPHABET X} P_{xy}(u) \exp( θ( c(x, u) + v(y))) \end{equation} \] Furthermore, let \(g^* \colon \ALPHABET X \to \ALPHABET U\) denote the policy that achieves the arg min in the right hand side. Then, \(J\) is the optimal performance and the time-homogeneous policy \(g^*\) achieves that performance.

The dynamic program of \eqref{eq:avg} can be written in an alternative form using a Legendre-type transformation and the duality relationship between relative entropy function and the logarithmic moment generating function.

Let \(Δ(\ALPHABET X)\) denote the set of probability vectors on \(\ALPHABET X\). Then, for any \(ν \in Δ(\ALPHABET X)\), the relative entropy \(I(\cdot \| ν) \colon Δ(\ALPHABET X) \to \reals \cup \{+∞\}\) is by given by \[ I(μ \| ν) = \begin{cases} \sum_{x \in \ALPHABET X} \log(λ(x)) μ(x), & \text{if } μ \ll ν, \\ +∞, & \text{otherwise}. \end{cases} \] where \[ λ(x) = \begin{cases} \frac{μ(x)}{ν(x)}, & \text{if } ν(x) \neq 0, \\ 1, & \text{otherwise}. \end{cases}\]

Lemma 1

Let \(w \colon \ALPHABET X \to \reals\) be a bounded function and \(ν \in Δ(\ALPHABET X)\). Then, \[ \log \sum_{x \in \ALPHABET X} ν(x) \exp( w(x)) = \sup_{μ \in Δ(\ALPHABET X)} \Bigl\{ \sum_{x \in \ALPHABET X} μ(x) w(x) - I(μ \| ν) \Bigr\}, \] where the supremum is attained at the unique probability measure \(μ^*\) given by \[ μ^*(x) = \frac{e^{θv(x)}}{\int e^{θv(x)}ν(x) dx} ν(x). \]

Such a dual-representation is a fundamental property of all coherent risk measures and not just the entropic risk measure that we are working with here. See, for example, Föllmer and Schied (2010).

Using Lemma 1, the dynamic program of \eqref{eq:avg} can be written as \[ \begin{equation} J + v(x) = \min_{u \in \ALPHABET U} \sup_{μ \in Δ(\ALPHABET X)} \Bigl\{ c(x,u) + \sum_{y \in \ALPHABET X} μ(y) v(y) - \frac{1}{θ} I(μ \| P(\cdot | x, u) ) \Bigr\}. \end{equation} \] This equation corresponds to the Issacs equation associated with a stochastic dynamic game with average cost-per unit time criterion.

3 Infinite horizon discounted cost setup

Whittle (2002) has a discussion on why discounted cost setup for risk-sensitive MDP is tricky and the solution depends on the interpretation of discounting.


The basic risk-sensitive MDP was first considered in Howard and Matheson (1972). See Howard and Matheson (1972) for a policy iteration algorithm. It is also mentioned that a version of this result is presented in Bellman’s book. See Hernandez-Hernández and Marcus (1996) and Hernández-Hernández (1999) for a detailed treatment of the average cost case.

Föllmer, H. and Schied, A. 2010. Convex risk measures. In: Encyclopedia of quantitative finance. American Cancer Society. DOI: 10.1002/9780470061602.eqf15003.
Hernandez-Hernández, D. and Marcus, S.I. 1996. Risk sensitive control of markov processes in countable state space. Systems & Control Letters 29, 3, 147–155. DOI: 10.1016/s0167-6911(96)00051-5.
Hernández-Hernández, D. 1999. Existence of risk-sensitive optimal stationary policies for controlled markov processes. Applied Mathematics and Optimization 40, 3, 273–285. DOI: 10.1007/s002459900126.
Howard, R.A. and Matheson, J.E. 1972. Risk-sensitive markov decision processes. Management Science 18, 7, 356–369. DOI: 10.1287/mnsc.18.7.356.
Whittle, P. 2002. Risk sensitivity, A strangely pervasive concept. Macroeconomic Dynamics 6, 1, 5–18. DOI: 10.1017/s1365100502027025.

This entry was last updated on 15 Jun 2020