32 DP for dynamic risk measures
risk sensitive, optimized certainty equivalent, cvar, dynamic risk measures
32.1 Beyond Entropic Risk
In the previous section, we discussed a DP decomposition for entropic risk. Entropic risk has many nice properties, but it fails to satisfy an important property called positive homogeneity: for a constant \(a\) \[ \mathcal{R}[a X] \neq a \mathcal{R}[X]. \] This means that preferences based on entropic risk depend on the unitis used to measure the cost or reward (e.g., cents vs dollars). This motivates the need to study more general risk measures.
32.1.1 Coherent risk measures and CVaR
A risk measure \(\mathcal{R}\) is a function that maps a random variable \(X \in \ALPHABET X\) to \(\reals\). A risk measure is said to be coherent if is satisfies the following properties:
- Monotonicity: if \(X \le Y\), a.s., then \(\mathcal{R}[X] \le \mathcal{R}[Y]\), higher losses mean higher risk.
- Translation equivariance: for all \(a \in \reals\), \(\mathcal{R}[a + X] \le a + \mathcal{R}[X]\),
- Sub-additivity: \(\mathcal{R}[X + Y] \le \mathcal{R}[X] + \mathcal{R}[Y]\).
- Positive homogeneity: \(\mathcal{R}[aX] = a \mathcal{R}[X]\) for all \(a \in \reals_{\ge 0}\).
These properties can be interpreted as follows:
- Monotonicity means that higher losses lead to higher risk.
- Translation equivariance means that increasing the loss by a constant amount increases the risk by the same amount.
- Sub-attivity means that diversification reduces risk.
- Positive homogeneity means that the risk does not depend on the unit of measurement.
A commonly used coherent risk measure is conditional value at risk (CVaR), which is also sometimes referred to as expected shortfall. In order to define CVaR, we first define value at risk (VaR).
Given a parameter \(α \in (0,1)\), the \(α\)-VaR of \(X\) is the \((1-α)\)-quantile of the distribution of \(X\), i.e., \[ \text{VaR}_{α}[X] = \min\{ x : \PR(X \le x) \ge 1 - α \}. \]
Given a parameter \(α \in (0,1)\), the \(α\)-CVaR of \(X\) is defined as \[ \text{CVar}_{α}[X] = \EXP[ X \mid X \ge \text{VaR}_{α}[X] ]. \]
The following result is known as Acerbi’s integral formula \[ \text{CVaR}_{α}[X] = \frac{1}{1 - α} \int_{α}^{1} \text{VaR}_{β} d β. \] Thus, \(\text{CVaR}_{α}\) is the average of \(\text{VaR}_{β}\), \(β \in [α, 1]\).
32.1.2 Optimized Certainty Equivalents (OCE)
For our purposes, a useful property of CVaR is that it be represented as \[ \text{CVaR}_{α}[X] = \inf_{m \in \reals} \biggl\{ m + \frac{1}{1 - α} \EXP \bigl[ [X - m]^{+} \bigr] \biggr\}. \] Furthermore, the minimizer of the above problem is \(\text{VaR}_{α}[X]\).
CVaR is special case of optimized certainty equivalents (OCE) which is defined as follows. Given a convex, nondecreasing loss function \(ℓ \colon \reals \to \reals\), the OCE risk of a random variable \(X\) is defined as \[ \mathcal{R}[X] = \inf_{m \in \reals} \{ m + \EXP[ℓ(X - m)] \}. \]
Some examples of OCE include:
- Mean, where \(\ell(x) = x\)
- Entropic risk, where \(\ell(x) = \frac{1}{θ}[ e^{θx} - 1]\)
- CVaR, where \(\ell(x) = \frac{1}{1-α} [x]^{+}\)
32.1.3 Cost vs rewards
The above definitions measure risk for a loss random variable \(X\), where large values are bad. In a reward maximization setting \(X\) represents a reward, so the lower tail is the dangerous one (i.e., the VaR and CVaR of reward \(X\) are the VaR and CVaR of cost \(-X\)).
In particular, in reward maximization, VaR is defined as the \(α\)-quantile of \(X\), i.e., \[\text{VaR}_{α}[X] = \max\{ x : \PR(X \ge x) \ge 1 - α \}.\] Then, CVaR focuses on the worst \((1-α)\) fraction of reward outcomes, i.e., \[ \text{CVaR}_{α}[X] = \EXP[ X \mid X \le \text{VaR}_{α}[X] ], \]
Moreover, in the reward maximization setting, the dual representation of CVaR is given by \[ \text{CVaR}_{α}[X] = \sup_{m \in \reals} \biggl\{ m - \frac{1}{1 - α} \EXP \bigl[ [m - X]^{+} \bigr] \biggr\}. \] The maximizer is again \(\text{VaR}_{α}[X]\).
The general form of OCE is as follows: given a convex, nondecreasing loss function \(\ell \colon \reals \to \reals\), The OCE risk of a (reward) \(X\) is defined as \[ \mathcal{R}[X] = \sup_{m \in \reals} \{ m - \EXP[\ell(m - X)] \}. \]
32.2 Static vs dynamic risk
Consider a finite horizon MDP. Fix a policy \(π\) and let \(C_t\) denote \(c(S_t, π_t(S_t))\). There are two ways to evaluate the risk of the sequence of random variables \(C_1, C_2, \dots, C_T\):
- Static risk: \(\mathcal{R}[C_1 + C_2 + \cdots + C_T]\).
- Dynamic risk: \(\mathcal{R}\bigl[ C_1 + \mathcal{R}[ C_2 + \mathcal{R}[ C_3 + \cdots \mathcal{R}[C_T] \cdots ] \mid S_2] \mid S_1 \bigr]\).
When these two risks are equal (e.g., in the risk neutral case), the risk measure is called time consistent. It was shown in Kupper and Schachermayer (2009) that entropic risk is the only law invariant risk measure with this property (Recall that expectation and worst-case are both special case of entropic risk measure).
Thus, there are two ways to formulate risk-aware MDPs with general risk measures: minimize the static risk or minimize the dynamic risk. Dynamic risk can be directly optimized using dynamic programming, while obtaining a dynamic program for optimizing static risk requires state augmentation, with \(C_1 + \cdots + C_{t-1}\) as the auxiliary part of the state.
In this section, we focus on dynamic risk measures.
32.3 Dynamic programming for dynamic risk measures
Consider a finite-horizon MDP with state space \(\ALPHABET S\), action space \(\ALPHABET A\), per-step cost \(c \colon \ALPHABET S \times \ALPHABET A \to \reals\), and transition kernel \(P(\cdot \mid s,a)\).
Let \(π = (π_1, \dots, π_T)\) be a deterministic policy. Then, the value function of policy \(π\) is defined as follows. Initialize \(V^π_{T+1}(s) = 0\) and for \(t \in \{T, \dots, 1\}\), recursively define \[\begin{align*} Q^π_t(s,a) &= c(s,a) + \mathcal{R}\bigl[ V^π_{t+1}(S_{t+1}) \mid S_t = s, A_t = a \bigr] \\ V^π_t(s) &= Q^π_t(s, π_t(s)). \end{align*}\]
A policy \(π^\star = (π^\star_1, \dots, π^\star_T)\) is optimal if \[ V^{π^\star}_1(s) \le V^{π}_1(s), \quad \forall s \in \ALPHABET S, \forall π \in Π. \]
By following the standard DP approach, we have the following. Define the optimal value functions as follows. Initialize \(V^\star_{T+1}(s) = 0\) and for \(t \in \{T, \dots, 1 \}\), recursively define \[\begin{align} Q^\star_t(s,a) &= c(s,a) + \mathcal{R}\bigl[ V^\star_{t+1}(S_{t+1}) \mid S_t = s, A_t = a \bigr] \label{eq:RS-DP1}\\ V^\star_t(s) &= \min_{a \in \ALPHABET A} Q^\star_t(s, a). \label{eq:RS-DP2} \end{align}\] Then, a policy \(π^\star\) is optimal if and only if \(π^\star_t\) achieves the arg min of \(Q^\star_t(s,a)\).
32.3.1 DP for OCE risk measures
For OCE risk measures, we can simplify \(\eqref{eq:RS-DP1}\)–\(\eqref{eq:RS-DP2}\) as follows: Define \[ H^\star_t(s,a,m) = c(s,a) + m + \sum_{s' \in \ALPHABET S} P(s'|s,a) \ell( V^\star_{t+1}(s') - m ). \] Then, \[ V^\star_t(s) = \min_{a \in \ALPHABET A} \inf_{m \in \ALPHABET R} H^\star_t(s,a,m). \]
32.4 Example: inventory management with CVaR risk
Let’s revisit the inventory management model with entropic risk. The solution for different values of risk-sensitivity parameter \(α\) are shown in Figure 32.1. As before, we assume that the deman is Binomial \((50, 0.4)\) with parameters \(c_h = 2\), \(c_s = 5\), \(p = 1\).
As \(α\) increases, the criterion focuses more on high-cost tail outcomes. Thus, shortage scenarios matter more, and therefore the threshold of the optimal policy increases, but the effect is minimal.
Let’s change the arrival distribution to a heavy tail distribution, say Negative Binomial\((5, 0.2)\), which has the same mean as Binomial\((50, 0.4)\) but higher deviation (\(\approx 10\) vs \(\approx 3.5\)). The solution for different values of \(α\) are shown in Figure 32.2.
Exercises
Exercise 32.1 Let \(X\) be a discrete random variable that takes \(n\) values \(\{x_1, \dots, x_n\}\) with probabilities \(\{p_1, \dots, p_n\}\) (with \(p_i = \PR(X = x_i)\)) where \[ x_1 < x_2 < \cdots < x_n. \]
Let \(k\) be the smallest index such that \[ \sum_{i=1}^k p_i \ge α. \] Define \(q_k = F(x_k) - \alpha\), where \(F\) is the cumulative distribution function of \(X\).
Then:
\(\text{VaR}_{α}[X] = x_k\).
\(\text{CVaR}_{α}[X] = \dfrac{1}{1-α} \bigl[q_k x_k + \sum_{i=k+1}^n p_i x_i\bigr]\).
Notes
The notion of coherent risk measure is due to Artzner et al. (1999). The notiono of dynamic risk measures is due to Ruszczyński (2010).