8 Monotonicity of value function and optimal policies

Author

Affiliation

Aditya Mahajan

McGill University

Updated

March 23, 2024

Keywords

MDPs, stochastic monotonicity, structural results

Summary

In many applications, it is useful to know if the value function and the optimal policy is weakly increasing (or weakly decreasing) in state. Sufficient conditions are presented under which such monotonicity properties hold. These properties are useful because they makes it easy to search and implement the optimal policy.

Consider the machine repair example from Exercise 5.4. We compute the optimal value function and optimal policy at $t=1$ for this model when $W \sim \text{Binom}(n,p)$ when $n = 10$, $T = 20$, $h(s) = 2s$, and $λ=20$ for different values of $p$.

T = 20
n = 10
c = 2
// K is lambda in the description
K = 20
h = function(s) { return c*s }

Pw = {
  const n = 10
  
  var points = new Array(n+1)
  var cumulative = 0
  var probability = 0

  for(var k = 0; k <= n; k++) {
    probability = binomial(n,k,p)
    cumulative += probability
    points[k] = { probability: probability, cumulative: cumulative }
  }

  return points
}

binomial = {
  // From https://stackoverflow.com/a/37715980/193149
  const logf = [0, 0, 0.6931471805599453, 1.791759469228055, 3.1780538303479458, 4.787491742782046, 6.579251212010101, 8.525161361065415, 10.60460290274525, 12.801827480081469, 15.104412573075516, 17.502307845873887, 19.987214495661885, 22.552163853123425, 25.19122118273868, 27.89927138384089, 30.671860106080672, 33.50507345013689, 36.39544520803305, 39.339884187199495, 42.335616460753485, 45.38013889847691, 48.47118135183523, 51.60667556776438, 54.78472939811232, 58.00360522298052, 61.261701761002, 64.55753862700634, 67.88974313718154, 71.25703896716801, 74.65823634883016, 78.0922235533153, 81.55795945611504, 85.05446701758152, 88.58082754219768, 92.1361756036871, 95.7196945421432, 99.33061245478743, 102.96819861451381, 106.63176026064346, 110.32063971475739, 114.0342117814617, 117.77188139974507, 121.53308151543864, 125.3172711493569, 129.12393363912722, 132.95257503561632, 136.80272263732635, 140.67392364823425, 144.5657439463449, 148.47776695177302, 152.40959258449735, 156.3608363030788, 160.3311282166309, 164.32011226319517, 168.32744544842765,  172.3527971391628, 176.39584840699735, 180.45629141754378, 184.53382886144948, 188.6281734236716, 192.7390472878449, 196.86618167289, 201.00931639928152, 205.1681994826412, 209.34258675253685, 213.53224149456327, 217.73693411395422, 221.95644181913033, 226.1905483237276, 230.43904356577696, 234.70172344281826, 238.97838956183432, 243.2688490029827, 247.57291409618688, 251.8904022097232, 256.22113555000954, 260.5649409718632, 264.9216497985528, 269.2910976510198, 273.6731242856937, 278.0675734403661, 282.4742926876304, 286.893133295427, 291.3239500942703, 295.76660135076065, 300.22094864701415, 304.6868567656687, 309.1641935801469, 313.65282994987905, 318.1526396202093, 322.66349912672615, 327.1852877037752, 331.7178871969285, 336.26118197919845, 340.815058870799, 345.37940706226686, 349.95411804077025, 354.5390855194408, 359.1342053695754, 363.73937555556347]

  return function(n, k, p) {
      return Math.exp(logf[n] - logf[n-k] - logf[k]) * p**k * (1-p)**(n-k)
  }
}

DP = {
  var DP = new Array()
  var idx = 0

  var V = new Array(n+1)
  var Q0 = new Array(n+1)
  var Q1 = new Array(n+1)

  var a = 0
  var val = 0

  // Initialize the terminal value function
  for (var s = 0; s <= n; s++) {
    V[1+s] = 0
  }

  // Dynamic Programming
  for (var t = T; t >= 1; t--) {
    //Q0[s] = h[s] + E[ V(s+W) ]
    //Q1[s] = K + E[ V(W) ]
    for (var s = 0; s <= n; s++) {
      Q0[1+s] = h(s)
      Q1[1+s] = K
      for (var w=0; w <= n; w++) {
        var s_next = Math.min(s+w, n)
        Q0[1+s] += V[ 1 + s_next ]*Pw[w].probability 
        Q1[1+s] += V[ 1 + w ]*Pw[w].probability 
      }
    }

    for (var s = 0; s <= n; s++) {
      if (Q0[1+s] <= Q1[1+s]) {
         a = 0
         val = Q0[1+s]
      } else {
         a = 1
         val = Q1[1+s]
      }
      DP[idx++] = { time: t, state: s, value: val, action: a }
      V[1+s] = val
    }
  }

  return DP;
}

viewof p = Inputs.range([0.1, 0.95], {label: "p", step: 0.1})
time = 1

value_plot = Plot.plot({
  grid: true,
  y: {domain: [120,380]},
  marks: [
    // Axes
    Plot.ruleX([0]),
    // Plot.ruleY([0]),
    // Data
    Plot.line(DP.filter(d => d.time == time), {x:"state", y:"value", curve:"step-after"})
  ]
})

action_plot = Plot.plot({
  grid: true,
  y : {ticks: 1 },
  marks: [
    // Axes
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    // Data
    Plot.ruleX(DP.filter(d => d.time == time), {x: "state", y: "action", strokeWidth: 2}),
    Plot.dot(DP.filter(d => d.time == time), {x: "state", y: "action", fill: "currentColor", r: 4})
  ]
})

(a) Value function

(b) Optimal policy

Figure 8.1: Solution of the machine repair problem

Qualitative property of optimal policy

In the machine replacement problem shown in Figure 8.1, the optimal policy satisfies a qualitative property: the optimal policy is monotone, that is if $s < s'$ then $π^*_t(s) \le π^*_t(s')$. Intuitively, it makes sense. But how can we establish that this is the case for other choices of problem parameters as well?

8.1 Stochastic dominance

Let $\ALPHABET S$ be a totally ordered finite set, say $\{1, \dots, n\}$.

Stochastic dominance is a partial order on random variables defined on totally ordered sets

(First order) stochastic dominance

Suppose $S^1$ and $S^2$ are $\ALPHABET S$ valued random variables where $S^1 \sim \mu^1$ and $S^2 \sim \mu^2$. We say $S^1$ stochastically dominates $S^2$ if for any $s \in \ALPHABET S$, \[\begin{equation}\label{eq:inc-prob} \PR(S^1 \ge s) \ge \PR(S^2 \ge s). \end{equation}\]

Stochastic domination is denoted by $S^1 \succeq_s S^2$ or $\mu^1 \succeq_s \mu^2$.

Let ${\rm M}^1$ and ${\rm M}^2$ denote the CDF of $\mu^1$ and $\mu^2$. Then \eqref{eq:inc-prob} is equivalent to the following: \[\begin{equation}\label{eq:cdf} {\rm M}^1_s \le {\rm M}^2_s, \quad \forall s \in \ALPHABET S. \end{equation}\] Thus, visually, $S^1 \succeq_s S^2$ means that the CDF of $S^1$ lies below the CDF of $S^2$.

Examples

$[0, 0, \tfrac 14, \tfrac 34] \succeq_s [0, \tfrac 14, \tfrac 34, 0] \succeq_s [\tfrac 14, \tfrac 34, 0, 0]$
$[0,0,0,1] \succeq_s [0,0, \tfrac 12, \tfrac 12] \succeq_s [0, \tfrac 13, \tfrac 13, \tfrac 13] \succeq_s [\tfrac 14, \tfrac 14, \tfrac 14, \tfrac 14]$

Stochastic dominance is important due to the following property.

Theorem 8.1 Let $f \colon \ALPHABET S \to \reals$ be a (weakly) increasing function and $S^1 \sim \mu^1$ and $S^2 \sim \mu^2$ are random variables defined on $\ALPHABET S$. Then $S^1 \succeq_s S^2$ if and only if \[\begin{equation}\label{eq:inc-fun} \EXP[f(S^1)] \ge \EXP[f(S^2)]. \end{equation}\]

Abel’s lemma or summation by parts

For any two sequences $\{f_k\}_{k \ge 1}$ and $\{g_k\}_{k \ge 1}$, \[\sum_{k=m}^n f_k(g_{k+1} - g_{k}) = (f_n g_{n+1} - f_m g_m) + \sum_{k=m+1}^n g_k(f_{k+1} - f_k).\]

Summation by parts may be viewed as the discrete analog of integration by parts: \[ \int_a^b f(x) g'(x) dx = f(x) g(x)\Big|_{a}^{b} - \int_{a}^{b} f'(x)g(x)dx. \] An alternative form which is sometimes useful is: \[ f_n g_n - f_m g_m = \sum_{k=m}^{n-1} f_k \Delta g_k + \sum_{k=m}^{n-1} g_k \Delta f_k + \sum_{k=m}^{n-1} \Delta f_k \Delta g_k. \]

Proof (stochastic dominance implies monotone expectations)

For the ease of notation, let $f_i$ to denote $f(i)$ and define ${\rm M}^1_0 = {\rm M}^2_0 = 0$. Consider the following: \[\begin{align*} \sum_{i=1}^n f_i \mu^1_i &= \sum_{i=1}^n f_i ({\rm M}^1_i - {\rm M}^1_{i-1}) \\ &\stackrel{(a)}= \sum_{i=1}^n {\rm M}^1_{i-1} (f_{i-1} - f_{i}) + f_n {\rm M}^1_n \\ &\stackrel{(b)}{\ge} \sum_{i=1}^n {\rm M}^2_{i-1} (f_{i-1} - f_{i}) + f_n {\rm M}^2_n \\ &\stackrel{(a)}= \sum_{i=1}^n f_i ({\rm M}^2_i - {\rm M}^2_{i-1}) \\ &= \sum_{i=1}^n f_i \mu_i, \end{align*}\] which completes the proof. In the above equations, both steps marked $(a)$ use summation by parts and $(b)$ uses the following facts:

For any $i$, ${\rm M}^1_{i-1} \le {\rm M}^2_{i-1}$ (because of \eqref{eq:cdf}) and $f_{i-1} - f_{i} < 0$ (because $f$ is increasing function). Thus, \[{\rm M}^1_{i-1}(f_{i-1} - f_i) \ge {\rm M}^2_{i-1}(f_{i-1} - f_i). \]
${\rm M}^1_n = {\rm M}^2_n = 1$.

Proof (monotone expectations implies stochastic monotonicity)

Suppose for any increasing function $f$, \eqref{eq:inc-fun} holds. Given any $i \in \{1, \dots, n\}$, define the function $f_i(k) = \IND\{k > i\}$, which is an increasing function of $k$. Then, \[ \EXP[f_i(S)] = \sum_{k=1}^n f_i(k) \mu^1_k = \sum_{k > i} \mu^1_k = 1 - {\rm M}^1_i. \] By a similar argument, we have \[ \EXP[f_i(S^2)] = 1 - {\rm M}^2_i. \] Since $\EXP[f_i(S)] \ge \EXP[f_i(S^2)]$, we have that ${\rm M}^1_i \le {\rm M}^2_i$.

We now state some properties of stochastic monotonicity. See Pomatto et al. (2020) for details.

If $X \succeq_s Y$ and $Z$ is a random variable independent of $X$ and $Y$, then $X + Z \succeq_s Y + Z$.
Given a random variable $Z$, we say that $Z'$ is noisier than $Z$ if there exists an independent random variable $W$ such that $Z' = Z + W$. If $X + Z \succeq_s Y + Z$ for some $Z$ independent of $X$ and $Y$, then $X + Z' \succeq_s Y + Z'$ for any independent $Z'$ noisier than $Z$.
Suppose $X$ and $Y$ are random variables such that $\EXP[X] > \EXP[Y]$. Then, there exists a random variable $Z$, independent of $X$ and $Y$, such that \[ X + Z \succeq_s Y + Z. \]
A real-valued function $C$ defined on the space of random variables with finite moments that satisfies the following properties:
- Certainty: $C(1) = 1$
- Monotonicity: If $X \succeq_s Y$ then $C(X) \ge C(Y)$.
- Additivity: If $X$ and $Y$ are independent then $C(X+Y) = C(X) + C(Y)$
if and only if $C(X) = \EXP[X]$.

8.2 Stochastic monotonicity

Stochastic monotonicity extends the notion of stochastic dominance to Markov chains. Suppose $\ALPHABET S$ is a totally ordered set and $\{S_t\}_{t \ge 1}$ is a time-homogeneous Markov chain on $\ALPHABET S$ with transition probability matrix $P$. Let $P_i$ denote the $i$-th row of $P$. Note that $P_i$ is a PMF.

Stochastic monotonicity

A Markov chain with transition matrix $P$ is stochastically monotone if \[ P_i \succeq_s P_j, \quad \forall i > j. \]

As an example, a birth death Markov chain (without self loops) is stochastically monotone, e.g., for any $p + q = 1$, the following is stochastic monotone \[\MATRIX{ q & p & 0 & 0 & 0 \\ q & 0 & p & 0 & 0 \\ 0 & q & 0 & p & 0 \\ 0 & 0 & q & 0 & p \\ 0 & 0 & 0 & q & p }\]

An immediate implication of the definition of stochastic monotinicity is the following.

Theorem 8.2 Let $\{S_t\}_{t \ge 1}$ be a Markov chain with transition matrix $P$ and $f \colon \ALPHABET S \to \reals$ is a weakly increasing function. Then, for any $s^1, s^2 \in \ALPHABET S$ such that $s^1 > s^2$, \[ \EXP[f(S_{t+1}) | S_t = s^1] \ge \EXP[ f(S_{t+1}) | S_t = s^2], \] if and only if $P$ is stochatically monotone.

8.3 Monotonicity of value functions

Theorem 8.3 Consider an MDP where the state space $\ALPHABET S$ is totally ordered. Suppose the following conditions are satisfied.

C1. For every $a \in \ALPHABET A$, the per-step cost $c_t(s,a)$ is weakly inceasing in $s$.

C2. For every $a \in \ALPHABET A$, the transition matrix $P(a)$ is stochastically monotone.

Then, the value function $V^*_t(s)$ is weakly increasing in $s$.

Note

The result above also applies to models with continuous (and totally ordered) state space provided the measurable selection conditions hold so that the arg min at each step of the dynamic program is attained.

Proof

We proceed by backward induction. By definition, $V^*_{T+1}(s) = 0$, which is weakly increasing. This forms the basis of induction. Assume that $V^*_{t+1}(s)$ is weakly increasing. Now consider, \[Q^*_t(s,a) = c_t(s,a) + \EXP[V^*_{t+1}(S_{t+1}) | S_t = s, A_t = a].\] For any $a \in \ALPHABET A$, $Q^*_t(s,a)$ is a sum of two weakly increasing functions in $s$; hence $Q^*_t(s,a)$ is weakly increasing in $s$.

Now consider $s_1, s_2 \in \ALPHABET S$ such that $s_1 > s_2$. Suppose $a_1^*$ is the optimal action at state $s_1$. Then \[ V^*_t(s^1) = Q^*_t(s^1, a_1^*) \stackrel{(a)}\ge Q^*_t(s^2,a_1^*) \stackrel{(b)}\ge V^*_t(s_2), \] where $(a)$ follows because $Q^*_t(\cdot, u^*)$ is weakly increasing and $(b)$ follows from the definition of the value function.

8.4 Submodularity

Submodularity

Let $\ALPHABET X$ and $\ALPHABET Y$ be partially ordered sets. A function $f \colon \ALPHABET X \times \ALPHABET Y \to \reals$ is called submodular if for any $x^+ \ge x^-$ and $y^+ \ge y^-$, we have \[\begin{equation}\label{eq:submodular} f(x^+, y^+) + f(x^-, y^-) \le f(x^+, y^-) + f(x^-, y^+). \end{equation}\]

The function is called supermodular if the inequality in \eqref{eq:submodular} is reversed.

A continuous and differentiable function on $\reals^2$ is submodular iff \[ \frac{ \partial^2 f(x,y) }{ \partial x \partial y } \le 0, \quad \forall x,y. \] If the inequality is reversed, then the function is supermodular.

Submodularity is a useful property because it implies monotonicity of the arg min.

Theorem 8.4 Let $\ALPHABET X$ be a partially ordered set, $\ALPHABET Y$ be a totally ordered set, and $f \colon \ALPHABET X \times \ALPHABET Y \to \reals$ be a submodular function. Suppose that for all $x$, $\arg \min_{y \in \ALPHABET Y} f(x,y)$ exists. Then, \[ π^*(x) := \max \{ y^* \in \arg \min_{y \in \ALPHABET Y} f(x,y) \} \] is weakly increasing in $x$.

Proof

Consider $x^+, x^- \in \ALPHABET X$ such that $x^+ \ge x^-$. Since $f$ is submodular, for any $y \le π^*(x^-)$, we have \[\begin{equation}\label{eq:1} f(x^+, π^*(x^-)) - f(x^+, y) \le f(x^-, π^*(x^-)) - f(x^-, y) \le 0, \end{equation}\] where the last inequality follows because $π^*(x^-)$ is the arg min of $f(x^-, y)$. Eq. \eqref{eq:1} implies that for all $y \le π^*(x^-)$, \[ f(x^+, π^*(x^-)) \le f(x^+, y). \] Thus, $π^*(x^+) \ge π^*(x^-)$.

The analogue of Theorem 8.4 for supermodular functions is as follows.

Theorem 8.5 Let $\ALPHABET X$ be a partially ordered set, $\ALPHABET Y$ be a totally ordered set, and $f \colon \ALPHABET X \times \ALPHABET Y \to \reals$ be a supermodular function. Suppose that for all $x$, $\arg \min_{y \in \ALPHABET Y} f(x,y)$ exists. Then, \[ π^*(x) := \min \{ y^* \in \arg \min_{y \in \ALPHABET Y} f(x,y) \} \] is weakly decreasing in $x$.

Proof

The proof is similar to Theorem 8.4.

Consider $x^+, x^- \in \ALPHABET X$ such that $x^+ \ge x^-$. Since $f$ is supermodular, for any $y \ge π^*(x^-)$, we have \[\begin{equation}\label{eq:2} f(x^+, y) - f(x^+, π^*(x^-)) \ge f(x^-, y) - f(x^-, π^*(x^-)) \ge 0, \end{equation}\] where the last inequality follows because $π^*(x^-)$ is the arg min of $f(x^-, y)$. Eq. \eqref{eq:2} implies that for all $y \ge π^*(x^-)$, \[ f(x^+, y) \ge f(x^+, π^*(x^-)). \] Thus, $π^*(x^+) \le π^*(x^-)$.

8.5 Monotonicity of optimal policy

Theorem 8.6 Consider an MDP where the state space $\ALPHABET S$ and the action space $\ALPHABET A$ are totally ordered. Suppose that, in addition to (C1) and (C2), the following condition is satisfied.

C3. For any weakly increasing function $v$, \[ c_t(s,a) + \EXP[ v(S_{t+1}) | S_t = s, A_t = a]\] is submodular in $(s,a)$.

Let $π^*_t(s) = \max\{ a^* \in \arg \min_{a \in \ALPHABET A} Q^*_t(s,a) \}$. Then, $π^*(s)$ is weakly increasing in $s$.

Proof

Conditions (C1) and (C2) imply that the value function $V^*_{t+1}(s)$ is weakly increasing. Therefore, condition (C3) implies that $Q^*_t(s,a)$ is submodular in $(s,a)$. Therefore, the arg min is weakly increasing in $x$

It is difficult to verify condition (C3). The following conditions are sufficient for (C3).

Lemma 8.1 Consider an MDP with totally ordered state and action spaces. Suppose

$c_t(s,a)$ is submodular in $(s,a)$.
For all $s' \in \ALPHABET S$, $H(s' | s,a) = 1 - \sum_{z \le s'} P_{sz}(a)$ is submodular in $(s,a)$.

The condition (C3) of Theorem 8.6 holds.

Proof

Consider $s^+, s^- \in \ALPHABET S$ and $a^+, a^- \in \ALPHABET A$ such that $s^+ > s^-$ and $a^+ > a^-$. Define

\[\begin{align*} μ_1(s) &= \tfrac 12 P_{s^- s}(a^-) + \tfrac 12 P_{s^+ s}(a^+), \\ μ_2(s) &= \tfrac 12 P_{s^- s}(a^+) + \tfrac 12 P_{s^+ s}(a^-). \end{align*}\] Since $H(s' | s,a)$ is submodular, we have \[ H(s' | s^+, a^+) + H(s' | s^-, a^-) \le H(s' | s^+, a^-) + H(s' | s^-, a^+) \] or equivalently, \[\sum_{z \le s'} \big[ P_{s^+ z}(a^+) + P_{s^- z}(a^-) \big] \ge \sum_{z \le s'} \big[ P_{s^+ z}(a^-) + P_{s^- z}(a^+) \big]. \] which implies \[ M_1(s') \ge M_2(s')\] where $M_1$ and $M_2$ are the CDFs of $μ_1$ and $μ_2$. Thus, $μ_1 \preceq_s μ_2$.

Hence, for any weakly increasing function $v \colon \ALPHABET S \to \reals$, \[ \sum_{s' \in \ALPHABET S} μ_1(s') v(s') \le \sum_{s' \in \ALPHABET S} μ_2(s') v(s').\] Or, equivalently, \[H(s^+, a^+) + H(s^-, a^-) \le H(s^-, a^+) + H(s^+, a^-)\] where $H(s,a) = \EXP[ v(S_{t+1}) | S_t = s, A_t = a]$.

Therefore, $c_t(s,a) + H_t(s,a)$ is submodular in $(s,a)$.

The analogue of Theorem 8.6 for supermodular functions is as follows.

Theorem 8.7 Consider an MDP where the state space $\ALPHABET S$ and the action space $\ALPHABET A$ are totally ordered. Suppose that, in addition to (C1) and (C2), the following condition is satisfied.

C4. For any weakly increasing function $v$, \[ c_t(s,a) + \EXP[ v(S_{t+1}) | S_t = s, A_t = a]\] is supermodular in $(s,a)$.

Let $π^*_t(s) = \min\{ a^* \in \arg \min_{a \in \ALPHABET S} Q^*_t(s,a) \}$. Then, $π^*(s)$ is weakly decreasing in $s$.

Proof

Conditions (C1) and (C2) imply that the value function $V^*_{t+1}(s)$ is weakly increasing. Therefore, condition (C4) implies that $Q^*_t(s,a)$ is supermodular in $(s,a)$. Therefore, the arg min is decreasing in $s$

It is difficult to verify condition (C4). The following conditions are sufficient for (C4).

Lemma 8.2 Consider an MDP with totally ordered state and action spaces. Suppose

$c_t(s,a)$ is supermodular in $(s,a)$.
For all $s' \in \ALPHABET S$, $H(s' | s,a) = 1 - \sum_{z \le s'} P_{sz}(a)$ is supermodular in $(s,a)$.

The condition (C4) of Theorem 8.7 holds.

Proof

Consider $s^+, s^- \in \ALPHABET S$ and $a^+, a^- \in \ALPHABET A$ such that $s^+ > s^-$ and $a^+ > a^-$. Define

\[\begin{align*} μ_1(s) &= \tfrac 12 P_{s^- s}(a^-) + \tfrac 12 P_{s^+ s}(a^+), \\ μ_2(s) &= \tfrac 12 P_{s^- s}(a^+) + \tfrac 12 P_{s^+ s}(a^-). \end{align*}\] Since $H(s' | s,a)$ is supermodular, we have \[ H(s' | s^+, a^+) + H(s' | s^-, a^-) \ge H(s' | s^+, a^-) + H(s' | s^-, a^+) \] or equivalently, \[\sum_{s' \le s'} \big[ P_{s^+ s'}(a^+) + P_{s^- s'}(a^-) \big] \le \sum_{s' \le s'} \big[ P_{s^+ s'}(a^-) + P_{s^- s'}(a^+) \big]. \] which implies \[ M_1(s') \le M_2(s')\] where $M_1$ and $M_2$ are the CDFs of $μ_1$ and $μ_2$. Thus, $μ_1 \succeq_s μ_2$.

Hence, for any weakly increasing function $v \colon \ALPHABET S \to \reals$, \[ \sum_{s' \in \ALPHABET S} μ_1(s') v(s') \ge \sum_{s' \in \ALPHABET S} μ_2(s') v(s').\] Or, equivalently, \[H(s^+, a^+) + H(s^-, a^-) \ge H(s^-, a^+) + H(s^+, a^-)\] where $H(s,a) = \EXP[ v(S_{t+1}) | S_t = s, A_t = a]$.

Therefore, $c_t(s,a) + H_t(s,a)$ is supermodular in $(s,a)$.

8.6 Constraints on actions

In the results above, we have assumed that the action set $\ALPHABET A$ is the same for all states. The results also extend to the case when the action at state $s$ must belong to some set $\ALPHABET A(s)$ provided the following conditions are satisfied:

For any $s \ge s'$, $\ALPHABET A(s) \supseteq \ALPHABET A(s')$
For any $s \in \ALPHABET S$ and $a \in \ALPHABET A(s)$, $a' < a$ implies that $a' \in \ALPHABET A(s)$.

8.7 Monotone dynamic programming

If we can establish that the optimal policy is monontone, then we can use this structure to implement the dynamic program more efficient. Suppose $\ALPHABET S = \{1, \dots, n\}$ and $\ALPHABET A = \{1, \dots. m\}$. The main idea is as follows. Suppose $V^*_{t+1}(\cdot)$ has been caclulated. Insead of computing $Q^*_t(s,a)$ and $V^*_t(s)$, proceed as follows:

Set $s = 1$ and $α = 1$.
For all $u \in \{α, \dots, m\}$, compute $Q^*_t(s,a)$ as usual.
Compute

\[V^*_t(s) = \min_{ α \le a \le m } Q^*_t(s,a)\]

and set

\[π^*_t(s) = \max \{ a \in \{α, \dots, m\} : V^*_t(s) = Q^*_t(s,a) \}.\]
If $s = n$, then stop. Otherwise, set $α = π^*_t(s)$ and $s = s+1$ and go to step 2.

8.8 Example: A machine replacement model

Let’s revisit the machine replacement problem of Exercise 5.4. For simplicity, we’ll assume that $n = ∞$, i.e., the state space is countable. In this case, the transition matrices are given by \[ P_{sz}(0) = \begin{cases} 0, & z < s \\ μ_{z - s}, & z \ge s \end{cases} \quad\text{and}\quad P_sz(1) = μ_z. \] where $μ$ is the PMF of $W$.

Proposition 8.1 For the machine replacement problem, there exist a series of thresholds $\{s^*_t\}_{t = 1}^T$ such that the optimal policy at time $t$ is a threshold policy with threshold $s_t$, i.e., \[ π^*_t(s) = \begin{cases} 0 & \text{if $s < s_t^*$} \\ 1 & \text{otherwise} \end{cases}\]

Proof

We prove the result by verifying conditions (C1)–(C4) to establish that the optimal policy is monotone.

C1. For $a = 0$, $c(s,0) = h(s)$, which is weakly increasing by assumption. For $a = 1$, $c(s,1) = K$, which is trivially weakly increasing.

C2. For $a = 0$, $P(0)$ is stochastically monotone (because the CDF of $P(\cdot | s, 0)$ lies above the CDF of $P(\cdot | s+1, 0)$). For $a = 1$, all rows of $P(1)$ are the same; therefore $P(1)$ is stochastically monotone.

Since (C1) and (C2) are satisfied, by Theorem 8.3, we can assert that the value function is weakly increasing.

C3. $c(s,1) - c(s,0) = K - h(s)$, which is weakly decreasing in $s$. Therefore, $c(s,a)$ is submodular in $(s,a)$.

C4. Recall that $H(s'|s,a) = 1 - \sum_{z \le s'} P_{sz}(a).$ Therefore,

\[H(s'|s,0) = 1 - \sum_{z = s}^{s'} μ_{z -s} = 1 - \sum_{k = 0}^{s' - s} μ_k = 1 - M_{s' - s},\] where $M$ is the CMF of $μ$, and \[H(s'|s,1) = 1 - \sum_{z \le s'} μ_z = 1 - M_{s'},\]

Therefore, $H(s'|s,1) - H(s'|s,0) = M_{s'-s} - M_{s'}$. For any fixed $s'$, $H(s'|s,1) - H(s'|s,0)$ is weakly decreasing in $s$. There $H(s'|s,a)$ is submodular in $(s,a)$.

Since (C1)–(C4) are satisfied, the optimal policy is weakly increasing in~$s$. Since there are only two actions, it means that for every time, there exists a state $s^*_t$ with the property that if $s$ exceeds $s^*_t$, the optimal decision is to replace the machine; and if $s \le s^*_t$, then the optimal decision is to operate the machine for another period.

Exercises

Exercise 8.1 Let $T$ denote a upper triangular matrix with 1’s on or below the diagonal and 0’s above the diagonal. Then \[ T^{-1}_{ij} = \begin{cases} 1, & \text{if } i = j, \\ -1, & \text{if } i = j - 1, \\ 0, & \text{otherwise}. \end{cases}\]

For example, for a $4 \times 4$ matrix \[ T = \MATRIX{1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1}, \quad T^{-1} = \MATRIX{1 & -1 & 0 & 0 \\ 0 & 1 & -1 & 0 \\ 0 & 0 & 1 & -1 \\ 0 & 0 & 0 & 1 }. \]

Show the following:

For any two PMFs $μ^1$ and $μ^2$, $\mu^1 \succeq_s \mu^2$ iff $\mu^1 T \ge \mu^2 T$.
A Markov transition matrix $P$ is stochastic monotone iff $T^{-1} P T \ge 0$.

Exercise 8.2 Show that the following are equivalent:

A transition matrix $P$ is stochastically monotone
For any two PMFs $μ^1$ and $μ^2$, if $\mu^1 \succeq_s \mu^2$ then $\mu^1P \succeq_s \mu^2P$.

Exercise 8.3 Show that if two transition matrices $P$ and $Q$ have the same dimensions and are stochastically monotone, then so are:

$\lambda P + (1 - \lambda) Q$, where $\lambda \in (0,1)$.
$P Q$
$P^k$, for $k \in \integers_{> 0}$.

Exercise 8.4 Let $\mu_t$ denote the distribution of a Markov chain at time $t$. Suppose $\mu_0 \succeq_s \mu_1$. Then $\mu_t \succeq_s \mu_{t+1}$.

Exercise 8.5 (Testing submodularity for functions defined on integers) Suppose a function $f \colon \integers \times \integers \to \reals$ satisfies \[ f(x+1, y+1) - f(x+1, y) \le f(x, y+1) - f(x, y) \] for all $x, y \in \integers$. Then, show that $f$ is a submodular function.

Exercise 8.6 Show that sum of submodular functions is submodular. Is the product of submodular functions submodular?

Exercise 8.7 Consider the example of machine repair presented in Example 5.2. Prove that the optimal policy for that model is weakly increasing.

Exercise 8.8 Suppose the state space $\ALPHABET S$ is a symmetric subset of integers of the form $\{-L, -L + 1, \dots, L-1, L\}$ and the action space $\ALPHABET A$ is discrete. Let $\ALPHABET S_{\ge 0}$ denote the set $\{0, \dots, L\}$.

Let $P(a)$ denote the controlled transition matrix and $c_t(s,a)$ denote the per-step cost. To avoid ambiguity, we define the optimal policy as \[ π^*_t(s) = \begin{cases} \max\bigl\{ a' \in \arg\min_{a \in \ALPHABET A} Q^*_t(s,a) \bigr\}, & \text{if } s \ge 0 \\ \min\bigl\{ a' \in \arg\min_{a \in \ALPHABET A} Q^*_t(s,a) \bigr\}, & \text{if } s < 0 \end{cases}\] The purpose of this exercise is to identify conditions under which the value function and the optimal policy are even and :quasi-convex. We do so using the following steps.

We say that the transition probability matrix $P(a)$ is even if for all $s, s' \in \ALPHABET S$, $P(s'|s,a) = P(-s'|-s,a)$. Prove the following result.

Proposition 8.2 Suppose the MDP satisfies the following properties:

(A1) For every $t$ and $a \in \ALPHABET A$, $c_t(s,a)$ is even function of $s$.

(A2) For every $a \in \ALPHABET A$, $P(a)$ is even.

Then, for all $t$, $V^*_t$ and $π^*_t$ are even functions.

Given any probability mass function $μ$ on $\ALPHABET S$, define the folded probability mass function $\tilde μ$ on $\ALPHABET S_{\ge 0}$ as follows: \[ \tilde μ(s) = \begin{cases} μ(0), & \text{if } s = 0 \\ μ(s) + μ(-s), & \text{if } s > 0. \end{cases} \]

For ease of notation, we use $\tilde μ = \mathcal F μ$ to denote this folding operation. Note that an immediate consequence of the definition is the following (you don’t have to prove this).

Lemma 8.3 If $f \colon \ALPHABET S \to \reals$ is even, then for any probability mass function $μ$ on $\ALPHABET S$ and $\tilde μ = \mathcal F μ$, we have \[ \sum_{s \in \ALPHABET S} f(s) μ(s) = \sum_{s \in \ALPHABET S_{\ge 0}} f(s) \tilde μ(s). \]

Thus, the expectation of the function $f \colon \ALPHABET S \to \reals$ with respect to the PMF $μ$ is equal to the expectation of the function $f \colon \ALPHABET S_{\ge 0} \to \reals$ with respect to the PMF $\tilde μ = \mathcal F μ$.

Now given any probability transition matrix $P$ on $\ALPHABET S$, we can define a probability transition matrix $\tilde P$ on $\ALPHABET S_{\ge 0}$ as follows: for any $s \in \ALPHABET S$, $\tilde P_s = \mathcal F P_s$, where $P_s$ denotes the $s$-th row of $P$. For ease of notation, we use $\tilde P = \mathcal F P$ to denote this relationship.

Now prove the following:

Proposition 8.3 Given the MDP $(\ALPHABET S, \ALPHABET A, P, \{c_t\})$, define the folded MDP as $(\ALPHABET S_{\ge 0}, \ALPHABET A, \tilde P, \{c_t\})$, where $\tilde P(a) = \mathcal F P(a)$ for all $a \in \ALPHABET A$. Let $\tilde Q^*_t \colon \ALPHABET S_{\ge 0} \times \ALPHABET A \to \reals$, $\tilde V^*_t \colon \ALPHABET S_{\ge 0} \to \reals$ and $\tilde π_t^* \colon \ALPHABET S_{\ge 0} \to \ALPHABET A$ denote the action-value function, value function and the policy of the folded MDP. Then, if the original MDP satisfies conditions (A1) and (A2) then, for any $s \in \ALPHABET S$ and $a \in \ALPHABET A$, \[ Q^*_t(s,a) = \tilde Q^*_t(|s|, a), \quad V^*_t(s) = \tilde V^*_t(|s|), \quad π_t^*(s) = \tilde π_t^*(|s|). \]

The result of the previous part implies that if the value function $\tilde V^*_t$ and the policy $\tilde π^*_t$ are monotone increasing, then the value function $V^*_t$ and the policy $π^*_t$ are even and quasi-convex. This gives us a method to verify if the value function and optimal policy are even and quasi-convex.

Now, recall the model of the Internet of Things presented in Exercise 5.3. The numerical experiments done in that exercise suggest that the value function and the optimal policy are even and quasi-convex. Prove that this is indeed the case.
Now suppose the distribution of $W_t$ is not Gaussian but is some general probability density $\varphi(\cdot)$ and the cost function is \[ c(e,a) = \lambda a + (1 - a) d(e). \] Find conditions on $\varphi$ and $d$ such that the value function and optimal policy are even and quasi-convex.

Notes

Stochastic dominance has been employed in various areas of economics, finance, and statistics since the 1930s. See Levy (1992) and Levy (2015) for detailed overviews. The notion of stochastic monotonicity for Markov chains is due to Daley (1968). For a generalization of stochastic monotonicity to continuous state spaces, see Serfozo (1976). The characterization of stochastic monotonicity in Exercise 8.1–Exercise 8.4 are due to Keilson and Kester (1977).

Ross (1974) has an early treatment of monotonicity of optimal policies. The general theory was developed by Topkis (1998). An alternative treatment for queueing models is presented in Koole (2006). The presentation here follows Puterman (2014).

The properties here are derived for finite horizon models. General conditions under which such properties extend to infinite horizon models are presented in Smith and McCardle (2002).

There are many recent papers which leverage the structural properties of value functions and optimal policy in reinforcement learning. For example Kunnumkal and Topaloglu (2008) and Fu and Schaar (2012) present variants of Q-learning which exploit properties of the value function and Roy et al. (2022) present a variant of policy-learning which exploits properties of the optimal policy.

Exercise 8.8 is from Chakravorty and Mahajan (2018). The idea of folded MDPs has also been used in Dutta and Singh (2024) for risk-sensitive version of the problem.

Chakravorty, J. and Mahajan, A. 2018. Sufficient conditions for the value function and optimal strategy to be even and quasi-convex. IEEE Transactions on Automatic Control 63, 11, 3858–3864. DOI: 10.1109/TAC.2018.2800796.

Daley, D.J. 1968. Stochastically monotone markov chains. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 10, 4, 305–317. DOI: 10.1007/BF00531852.

Dutta, M. and Singh, R. 2024. Optimal risk-sensitive scheduling policies for remote estimation of autoregressive markov processes. Available at: http://arxiv.org/pdf/2403.13898v1.

Fu, F. and Schaar, M. van der. 2012. Structure-aware stochastic control for transmission scheduling. IEEE Transactions on Vehicular Technology 61, 9, 3931–3945. DOI: 10.1109/tvt.2012.2213850.

Keilson, J. and Kester, A. 1977. Monotone matrices and monotone markov processes. Stochastic Processes and their Applications 5, 3, 231–241.

Koole, G. 2006. Monotonicity in markov reward and decision chains: Theory and applications. Foundations and Trends in Stochastic Systems 1, 1, 1–76. DOI: 10.1561/0900000002.

Kunnumkal, S. and Topaloglu, H. 2008. Exploiting the structural properties of the underlying markov decision problem in the q-learning algorithm. INFORMS Journal on Computing 20, 2, 288–301. DOI: 10.1287/ijoc.1070.0240.

Levy, H. 1992. Stochastic dominance and expected utility: Survey and analysis. Management Science 38, 4, 555–593. DOI: 10.1287/mnsc.38.4.555.

Levy, H. 2015. Stochastic dominance: Investment decision making under uncertainty. Springer. DOI: 10.1007/978-3-319-21708-6.

Pomatto, L., Strack, P., and Tamuz, O. 2020. Stochastic dominance under independent noise. Journal of Political Economy 128, 5, 1877–1900. DOI: 10.1086/705555.

Puterman, M.L. 2014. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons. DOI: 10.1002/9780470316887.

Ross, S.M. 1974. Dynamic programming and gambling models. Advances in Applied Probability 6, 3, 593–606. DOI: 10.2307/1426236.

Roy, A., Borkar, V., Karandikar, A., and Chaporkar, P. 2022. Online reinforcement learning of optimal threshold policies for Markov decision processes. IEEE Transactions on Automatic Control 67, 7, 3722–3729. DOI: 10.1109/tac.2021.3108121.

Serfozo, R.F. 1976. Monotone optimal policies for markov decision processes. In: Mathematical programming studies. Springer Berlin Heidelberg, 202–215. DOI: 10.1007/bfb0120752.

Smith, J.E. and McCardle, K.F. 2002. Structural properties of stochastic dynamic programs. Operations Research 50, 5, 796–809. DOI: 10.1287/opre.50.5.796.365.

Topkis, D.M. 1998. Supermodularity and complementarity. Princeton University Press.