32 DP for dynamic risk measures

Author

Affiliation

Aditya Mahajan

McGill University

Updated

May 11, 2026

Keywords

risk sensitive, optimized certainty equivalent, cvar, dynamic risk measures

32.1 Beyond Entropic Risk

In the previous section, we discussed a DP decomposition for entropic risk. Entropic risk has many nice properties, but it fails to satisfy an important property called positive homogeneity: for a constant \(a\) \[ \mathcal{R}[a X] \neq a \mathcal{R}[X]. \] This means that preferences based on entropic risk depend on the units used to measure the cost or reward (e.g., cents vs dollars). This motivates the need to study more general risk measures.

32.1.1 Coherent risk measures and CVaR

A risk measure \(\mathcal{R}\) is a function that maps a random variable \(X \in \ALPHABET X\) to \(\reals\). A risk measure is said to be coherent if it satisfies the following properties:

Monotonicity: if \(X \le Y\), a.s., then \(\mathcal{R}[X] \le \mathcal{R}[Y]\), higher losses mean higher risk.
Translation equivariance: for all \(a \in \reals\), \(\mathcal{R}[a + X] \le a + \mathcal{R}[X]\),
Sub-additivity: \(\mathcal{R}[X + Y] \le \mathcal{R}[X] + \mathcal{R}[Y]\).
Positive homogeneity: \(\mathcal{R}[aX] = a \mathcal{R}[X]\) for all \(a \in \reals_{\ge 0}\).

These properties can be interpreted as follows:

Monotonicity means that higher losses lead to higher risk.
Translation equivariance means that increasing the loss by a constant amount increases the risk by the same amount.
Sub-additivity means that diversification reduces risk.
Positive homogeneity means that the risk does not depend on the unit of measurement.

A commonly used coherent risk measure is conditional value at risk (CVaR), which is also sometimes referred to as expected shortfall. In order to define CVaR, we first define value at risk (VaR).

Given a parameter \(α \in (0,1)\), the \(α\)-VaR of \(X\) is the \((1-α)\)-quantile of the distribution of \(X\), i.e., \[ \text{VaR}_{α}[X] = \min\{ x : \PR(X \le x) \ge 1 - α \}. \]

Given a parameter \(α \in (0,1)\), the \(α\)-CVaR of \(X\) is defined as \[ \text{CVaR}_{α}[X] = \EXP[ X \mid X \ge \text{VaR}_{α}[X] ]. \]

The following result is known as Acerbi’s integral formula \[ \text{CVaR}_{α}[X] = \frac{1}{1 - α} \int_{α}^{1} \text{VaR}_{β} d β. \] Thus, \(\text{CVaR}_{α}\) is the average of \(\text{VaR}_{β}\), \(β \in [α, 1]\).

32.1.2 Optimized Certainty Equivalents (OCE)

For our purposes, a useful property of CVaR is that it be represented as \[ \text{CVaR}_{α}[X] = \inf_{m \in \reals} \biggl\{ m + \frac{1}{1 - α} \EXP \bigl[ [X - m]^{+} \bigr] \biggr\}. \] Furthermore, the minimizer of the above problem is \(\text{VaR}_{α}[X]\).

CVaR is a special case of optimized certainty equivalents (OCE) which is defined as follows. Given a convex, nondecreasing loss function \(ℓ \colon \reals \to \reals\), the OCE risk of a random variable \(X\) is defined as \[ \mathcal{R}[X] = \inf_{m \in \reals} \{ m + \EXP[ℓ(X - m)] \}. \]

Some examples of OCE include:

Mean, where \(\ell(x) = x\)
Entropic risk, where \(\ell(x) = \frac{1}{θ}[ e^{θx} - 1]\)
CVaR, where \(\ell(x) = \frac{1}{1-α} [x]^{+}\)

32.1.3 Cost vs rewards

The above definitions measure risk for a loss random variable \(X\), where large values are bad. In a reward maximization setting \(X\) represents a reward, so the lower tail is the dangerous one (i.e., the VaR and CVaR of reward \(X\) are the VaR and CVaR of cost \(-X\)).

In particular, in reward maximization, VaR is defined as the \(α\)-quantile of \(X\), i.e., \[\text{VaR}_{α}[X] = \max\{ x : \PR(X \ge x) \ge 1 - α \}.\] Then, CVaR focuses on the worst \((1-α)\) fraction of reward outcomes, i.e., \[ \text{CVaR}_{α}[X] = \EXP[ X \mid X \le \text{VaR}_{α}[X] ], \]

Moreover, in the reward maximization setting, the dual representation of CVaR is given by \[ \text{CVaR}_{α}[X] = \sup_{m \in \reals} \biggl\{ m - \frac{1}{1 - α} \EXP \bigl[ [m - X]^{+} \bigr] \biggr\}. \] The maximizer is again \(\text{VaR}_{α}[X]\).

The general form of OCE is as follows: given a convex, nondecreasing loss function \(\ell \colon \reals \to \reals\), the OCE risk of a reward random variable \(X\) is defined as \[ \mathcal{R}[X] = \sup_{m \in \reals} \{ m - \EXP[\ell(m - X)] \}. \]

32.2 Static vs dynamic risk

Consider a finite horizon MDP. Fix a policy \(π\) and let \(C_t\) denote \(c(S_t, π_t(S_t))\). There are two ways to evaluate the risk of the sequence of random variables \(C_1, C_2, \dots, C_T\):

Static risk: \(\mathcal{R}[C_1 + C_2 + \cdots + C_T]\).
Dynamic risk: \(\mathcal{R}\bigl[ C_1 + \mathcal{R}[ C_2 + \mathcal{R}[ C_3 + \cdots \mathcal{R}[C_T] \cdots ] \mid S_2] \mid S_1 \bigr]\).

When these two risks are equal (e.g., in the risk neutral case), the risk measure is called time consistent. It was shown in Kupper and Schachermayer (2009) that entropic risk is the only law invariant risk measure with this property (Recall that expectation and worst-case are both special cases of the entropic risk measure).

Thus, there are two ways to formulate risk-aware MDPs with general risk measures: minimize the static risk or minimize the dynamic risk. Dynamic risk can be directly optimized using dynamic programming, while obtaining a dynamic program for optimizing static risk requires state augmentation, with \(C_1 + \cdots + C_{t-1}\) as the auxiliary part of the state.

In this section, we focus on dynamic risk measures.

32.3 Dynamic programming for dynamic risk measures

Consider a finite-horizon MDP with state space \(\ALPHABET S\), action space \(\ALPHABET A\), per-step cost \(c \colon \ALPHABET S \times \ALPHABET A \to \reals\), and transition kernel \(P(\cdot \mid s,a)\).

Let \(π = (π_1, \dots, π_T)\) be a deterministic policy. Then, the value function of policy \(π\) is defined as follows. Initialize \(V^π_{T+1}(s) = 0\) and for \(t \in \{T, \dots, 1\}\), recursively define \[\begin{align*} Q^π_t(s,a) &= c(s,a) + \mathcal{R}\bigl[ V^π_{t+1}(S_{t+1}) \mid S_t = s, A_t = a \bigr] \\ V^π_t(s) &= Q^π_t(s, π_t(s)). \end{align*}\]

A policy \(π^\star = (π^\star_1, \dots, π^\star_T)\) is optimal if \[ V^{π^\star}_1(s) \le V^{π}_1(s), \quad \forall s \in \ALPHABET S, \forall π \in Π. \]

By following the standard DP approach, we have the following. Define the optimal value functions as follows. Initialize \(V^\star_{T+1}(s) = 0\) and for \(t \in \{T, \dots, 1 \}\), recursively define \[\begin{align} Q^\star_t(s,a) &= c(s,a) + \mathcal{R}\bigl[ V^\star_{t+1}(S_{t+1}) \mid S_t = s, A_t = a \bigr] \label{eq:RS-DP1}\\ V^\star_t(s) &= \min_{a \in \ALPHABET A} Q^\star_t(s, a). \label{eq:RS-DP2} \end{align}\] Then, a policy \(π^\star\) is optimal if and only if \(π^\star_t\) achieves the arg min of \(Q^\star_t(s,a)\).

32.3.1 DP for OCE risk measures

For OCE risk measures, we can simplify \(\eqref{eq:RS-DP1}\)–\(\eqref{eq:RS-DP2}\) as follows: Define \[ H^\star_t(s,a,m) = c(s,a) + m + \sum_{s' \in \ALPHABET S} P(s'|s,a) \ell( V^\star_{t+1}(s') - m ). \] Then, \[ V^\star_t(s) = \min_{a \in \ALPHABET A} \inf_{m \in \ALPHABET R} H^\star_t(s,a,m). \]

32.4 Example: inventory management with CVaR risk

Let’s revisit the inventory management model with CVaR risk. The solution for different values of risk-sensitivity parameter \(α\) are shown in Figure 32.1. As before, we assume that the demand is Binomial\((50, 0.4)\) with parameters \(c_h = 2\), \(c_s = 5\), \(p = 1\).

viewof OCE_alpha_slide = Object.assign(
  Inputs.range([OCE_alpha_min, OCE_alpha_max], {
    label: "α",
    step: OCE_alpha_step,
    value: 0.8
  }),
  { style: "--label-width:40px" }
)

OCE_val_plot = Plot.plot({
  grid: true,
  x: { domain: [OCE_xmin, OCE_xmax], label: "s" },
  y: {
    domain: OCE_y_value,
    label: `CVaR value (alpha = ${OCE_alpha_slide})`,
  },
  marks: [
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    Plot.line(
      OCE_df.filter((d) => Math.abs(d.alpha - OCE_alpha_slide) < 1e-6),
      {
        x: "state",
        y: "value",
        curve: "step-after",
        strokeWidth: 2,
      }
    ),
  ],
})

OCE_pol_plot = Plot.plot({
  grid: true,
  x: { domain: [OCE_xmin, OCE_xmax], label: "s" },
  y: {
    domain: OCE_y_policy,
    label: `order (alpha = ${OCE_alpha_slide})`,
  },
  marks: [
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    Plot.line(
      OCE_df.filter((d) => Math.abs(d.alpha - OCE_alpha_slide) < 1e-6),
      {
        x: "state",
        y: "policy",
        curve: "step-after",
        strokeWidth: 2,
      }
    ),
  ],
})

(a) CVaR value \(V_1(s)\) at \(t=1\).

(b) Optimal order quantity \(π_1(s)\) at \(t=1\).

Figure 32.1: Inventory DP under CVaR risk for Binomial demand.

As \(α\) increases, the criterion focuses more on high-cost tail outcomes. Thus, shortage scenarios matter more, and therefore the threshold of the optimal policy increases, but the effect is minimal.

Let’s change the demand distribution to a heavy-tailed distribution, say Negative Binomial\((5, 0.2)\), which has the same mean as Binomial\((50, 0.4)\) but a much higher standard deviation (\(\approx 10\) vs \(\approx 3.5\)). The solution for different values of \(α\) are shown in Figure 32.2.

viewof HT_alpha_slide = Object.assign(
  Inputs.range([HT_alpha_min, HT_alpha_max], {
    label: "α",
    step: HT_alpha_step,
    value: 0.0
  }),
  { style: "--label-width:40px" }
)

HT_val_plot = Plot.plot({
  grid: true,
  x: { domain: [HT_xmin, HT_xmax], label: "s" },
  y: {
    domain: HT_y_value,
    label: `CVaR value (alpha = ${HT_alpha_slide})`,
  },
  marks: [
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    Plot.line(
      HT_df.filter((d) => Math.abs(d.alpha - HT_alpha_slide) < 1e-6),
      {
        x: "state",
        y: "value",
        curve: "step-after",
        strokeWidth: 2,
      }
    ),
  ],
})

HT_pol_plot = Plot.plot({
  grid: true,
  x: { domain: [HT_xmin, HT_xmax], label: "s" },
  y: {
    domain: HT_y_policy,
    label: `order (alpha = ${HT_alpha_slide})`,
  },
  marks: [
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    Plot.line(
      HT_df.filter((d) => Math.abs(d.alpha - HT_alpha_slide) < 1e-6),
      {
        x: "state",
        y: "policy",
        curve: "step-after",
        strokeWidth: 2,
      }
    ),
  ],
})

(a) CVaR value \(V_1(s)\) at \(t=1\).

(b) Optimal order quantity \(π_1(s)\) at \(t=1\).

Figure 32.2: Inventory DP under CVaR risk for the Negative Binomial demand.

Exercises

Exercise 32.1 Let \(X\) be a discrete random variable that takes \(n\) values \(\{x_1, \dots, x_n\}\) with probabilities \(\{p_1, \dots, p_n\}\) (with \(p_i = \PR(X = x_i)\)) where \[ x_1 < x_2 < \cdots < x_n. \]

Let \(k\) be the smallest index such that \[ \sum_{i=1}^k p_i \ge α. \] Define \(q_k = F(x_k) - \alpha\), where \(F\) is the cumulative distribution function of \(X\).

Then:

\(\text{VaR}_{α}[X] = x_k\).
\(\text{CVaR}_{α}[X] = \dfrac{1}{1-α} \bigl[q_k x_k + \sum_{i=k+1}^n p_i x_i\bigr]\).

Notes

The notion of coherent risk measure is due to Artzner et al. (1999). The notion of dynamic risk measures is due to Ruszczyński (2010).