7 Monotonicity of value function and optimal policies

Updated

May 12, 2023

Summary

In many applications, it is useful to know if the value function and the optimal policy is weakly increasing (or weakly decreasing) in state. Sufficient conditions are presented under which such monotonicity properties hold. These properties are useful because they makes it easy to search and implement the optimal policy.

Consider a manufacturing process, where the machine used for manufacturing deteriorates over time. Let $\ALPHABET S = \{0, 1, \dots n \}$ represent the condition of the machine. The higher the value of $s$, the worse the condition of the equipment.

A decision maker observes the state of the machine and has two options: continue operating the machine or replace it with a a new and identical piece of equipment. Operating the machine is state $s$ costs $h(s)$, where $h(⋅)$ is a weakly increasing function; replace the machine costs a constant amount $K$.

When the machine is operated, it’s state deteriorates according to \[ S_{t+1} = \min( S_t + W_t , n) \] where $\{W_t\}_{t \ge 1}$ is an i.i.d. process with PMF $μ$.

We model the above model as an MDP with state space $\ALPHABET S$ and action space $\ALPHABET A = \{0, 1\}$ where $0$ means operating the machine and $1$ means replacing the machine.

We compute the optimal policy for this model when $W \sim \text{Binom}(n,p)$ when $n = 10$, $T = 20$, $h(s) = 2s$, and $K=20$ for different values of $p$.

T = 20
n = 10
c = 2
K = 20
h = function(s) { return c*s }

Pw = {
  const n = 10
  
  var points = new Array(n+1)
  var cumulative = 0
  var probability = 0

  for(var k = 0; k <= n; k++) {
    probability = binomial(n,k,p)
    cumulative += probability
    points[k] = { probability: probability, cumulative: cumulative }
  }

  return points
}

binomial = {
  // From https://stackoverflow.com/a/37715980/193149
  const logf = [0, 0, 0.6931471805599453, 1.791759469228055, 3.1780538303479458, 4.787491742782046, 6.579251212010101, 8.525161361065415, 10.60460290274525, 12.801827480081469, 15.104412573075516, 17.502307845873887, 19.987214495661885, 22.552163853123425, 25.19122118273868, 27.89927138384089, 30.671860106080672, 33.50507345013689, 36.39544520803305, 39.339884187199495, 42.335616460753485, 45.38013889847691, 48.47118135183523, 51.60667556776438, 54.78472939811232, 58.00360522298052, 61.261701761002, 64.55753862700634, 67.88974313718154, 71.25703896716801, 74.65823634883016, 78.0922235533153, 81.55795945611504, 85.05446701758152, 88.58082754219768, 92.1361756036871, 95.7196945421432, 99.33061245478743, 102.96819861451381, 106.63176026064346, 110.32063971475739, 114.0342117814617, 117.77188139974507, 121.53308151543864, 125.3172711493569, 129.12393363912722, 132.95257503561632, 136.80272263732635, 140.67392364823425, 144.5657439463449, 148.47776695177302, 152.40959258449735, 156.3608363030788, 160.3311282166309, 164.32011226319517, 168.32744544842765,  172.3527971391628, 176.39584840699735, 180.45629141754378, 184.53382886144948, 188.6281734236716, 192.7390472878449, 196.86618167289, 201.00931639928152, 205.1681994826412, 209.34258675253685, 213.53224149456327, 217.73693411395422, 221.95644181913033, 226.1905483237276, 230.43904356577696, 234.70172344281826, 238.97838956183432, 243.2688490029827, 247.57291409618688, 251.8904022097232, 256.22113555000954, 260.5649409718632, 264.9216497985528, 269.2910976510198, 273.6731242856937, 278.0675734403661, 282.4742926876304, 286.893133295427, 291.3239500942703, 295.76660135076065, 300.22094864701415, 304.6868567656687, 309.1641935801469, 313.65282994987905, 318.1526396202093, 322.66349912672615, 327.1852877037752, 331.7178871969285, 336.26118197919845, 340.815058870799, 345.37940706226686, 349.95411804077025, 354.5390855194408, 359.1342053695754, 363.73937555556347]

  return function(n, k, p) {
      return Math.exp(logf[n] - logf[n-k] - logf[k]) * p**k * (1-p)**(n-k)
  }
}

DP = {
  var DP = new Array()
  var idx = 0

  var V = new Array(n+1)
  var Q0 = new Array(n+1)
  var Q1 = new Array(n+1)

  var a = 0
  var val = 0

  // Initialize the terminal value function
  for (var s = 0; s <= n; s++) {
    V[1+s] = 0
  }

  // Dynamic Programming
  for (var t = T; t >= 1; t--) {
    //Q0[s] = h[s] + E[ V(s+W) ]
    //Q1[s] = K + E[ V(W) ]
    for (var s = 0; s <= n; s++) {
      Q0[1+s] = h(s)
      Q1[1+s] = K
      for (var w=0; w <= n; w++) {
        var s_next = Math.min(s+w, n)
        Q0[1+s] += V[ 1 + s_next ]*Pw[w].probability 
        Q1[1+s] += V[ 1 + w ]*Pw[w].probability 
      }
    }

    for (var s = 0; s <= n; s++) {
      if (Q0[1+s] <= Q1[1+s]) {
         a = 0
         val = Q0[1+s]
      } else {
         a = 1
         val = Q1[1+s]
      }
      DP[idx++] = { time: t, state: s, value: val, action: a }
      V[1+s] = val
    }
  }

  return DP;
}

viewof p = Inputs.range([0.1, 0.95], {label: "p", step: 0.1})
time = 1

value_plot = Plot.plot({
  grid: true,
  y: {domain: [120,380]},
  marks: [
    // Axes
    Plot.ruleX([0]),
    // Plot.ruleY([0]),
    // Data
    Plot.line(DP.filter(d => d.time == time), {x:"state", y:"value", curve:"step-after"})
  ]
})

action_plot = Plot.plot({
  grid: true,
  y : {ticks: 1 },
  marks: [
    // Axes
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    // Data
    Plot.ruleX(DP.filter(d => d.time == time), {x: "state", y: "action", strokeWidth: 2}),
    Plot.dot(DP.filter(d => d.time == time), {x: "state", y: "action", fill: "currentColor", r: 4})
  ]
})

(a) Value function

(b) Optimal policy

Figure 7.1: Solution of the machine repair problem

Qualitative property of optimal policy

In the machine replacement problem shown in Figure 7.1, the optimal policy satisfies a qualitative property: the optimal policy is monotone, that is if $s < s'$ then $π_t(s) < π_t(s')$. Intuitively, it makes sense. But how can we establish that this is the case for other choices of problem parameters as well?

7.1 Stochastic dominance

Let $\ALPHABET S$ be a totally ordered finite set, say $\{1, \dots, n\}$.

Stochastic dominance is a partial order on random variables defined on totally ordered sets

(First order) stochastic dominance

Suppose $S^1$ and $S^2$ are $\ALPHABET S$ valued random variables where $S^1 \sim \mu^1$ and $S^2 \sim \mu^2$. We say $S^1$ stochastically dominates $S^2$ if for any $s \in \ALPHABET S$, \[\begin{equation}\label{eq:inc-prob} \PR(S^1 \ge s) \ge \PR(S^2 \ge s). \end{equation}\]

Stochastic domination is denoted by $S^1 \succeq_s S^2$ or $\mu^1 \succeq_s \mu^2$.

Let ${\rm M}^1$ and ${\rm M}^2$ denote the CDF of $\mu^1$ and $\mu^2$. Then \eqref{eq:inc-prob} is equivalent to the following: \[\begin{equation}\label{eq:cdf} {\rm M}^1_s \le {\rm M}^2_s, \quad \forall s \in \ALPHABET S. \end{equation}\] Thus, visually, $S^1 \succeq_s S^2$ means that the CDF of $S^1$ lies below the CDF of $S^2$.

Example

$\left[0, \frac 14, \frac 14, \frac 12\right] \succeq_s \left[\frac 14, 0, \frac 14, \frac 12 \right] \succeq_s \left[\frac 14, \frac 14, \frac 14, \frac 14 \right].$

Stochastic dominance is important due to the following property.

Theorem 7.1 Let $f \colon \ALPHABET S \to \reals$ be a (weakly) increasing function and $S^1 \sim \mu^1$ and $S^2 \sim \mu^2$ are random variables defined on $\ALPHABET S$. Then $S^1 \succeq_s S^2$ if and only if \[\begin{equation}\label{eq:inc-fun} \EXP[f(S^1)] \ge \EXP[f(S^2)]. \end{equation}\]

Proof (stochastic dominance implies monotone expectations)

For the ease of notation, let $f_i$ to denote $f(i)$ and define ${\rm M}^1_0 = {\rm M}^2_0 = 0$. Consider the following: \[\begin{align*} \sum_{i=1}^n f_i \mu^1_i &= \sum_{i=1}^n f_i ({\rm M}^1_i - {\rm M}^1_{i-1}) \\ &= \sum_{i=1}^n {\rm M}^1_{i-1} (f_{i-1} - f_{i}) + f_n {\rm M}^1_n \\ &\stackrel{(a)}{\ge} \sum_{i=1}^n {\rm M}^2_{i-1} (f_{i-1} - f_{i}) + f_n {\rm M}^2_n \\ &= \sum_{i=1}^n f_i ({\rm M}^2_i - {\rm M}^2_{i-1}) \\ &= \sum_{i=1}^n f_i \mu_i, \end{align*}\] which completes the proof. In the above equations, $(a)$ uses the following facts:

For any $i$, ${\rm M}^1_{i-1} \le {\rm M}^2_{i-1}$ (because of \eqref{eq:cdf}) and $f_{i-1} - f_{i} < 0$ (because $f$ is increasing function). Thus, \[{\rm M}^1_{i-1}(f_{i-1} - f_i) \ge {\rm M}^2_{i-1}(f_{i-1} - f_i). \]
${\rm M}^1_n = {\rm M}^2_n = 1$.

Proof (monotone expectations implies stochastic monotonicity)

Suppose for any increasing function $f$, \eqref{eq:inc-fun} holds. Given any $i \in \{1, \dots, n\}$, define the function $f_i(k) = \IND\{k > i\}$, which is an increasing function of $k$. Then, \[ \EXP[f_i(S)] = \sum_{k=1}^n f_i(k) \mu^1_k = \sum_{k > i} \mu^1_k = 1 - {\rm M}^1_i. \] By a similar argument, we have \[ \EXP[f_i(S^2)] = 1 - {\rm M}^2_i. \] Since $\EXP[f_i(S)] \ge \EXP[f_i(S^2)]$, we have that ${\rm M}^1_i \le {\rm M}^2_i$.

7.2 Stochastic monotonicity

Stochastic monotonicity extends the notion of stochastic dominance to Markov chains. Suppose $\ALPHABET S$ is a totally ordered set and $\{S_t\}_{t \ge 1}$ is a time-homogeneous Markov chain on $\ALPHABET S$ with transition probability matrix $P$. Let $P_i$ denote the $i$-th row of $P$. Note that $P_i$ is a PMF.

Stochastic monotonicity

A Markov chain with transition matrix $P$ is stochastically monotone if \[ P_i \succeq_s P_j, \quad \forall i > j. \]

An immediate implication is the following.

Theorem 7.2 Let $\{S_t\}_{t \ge 1}$ be a Markov chain with transition matrix $P$ and $f \colon \ALPHABET S \to \reals$ is a weakly increasing function. Then, for any $s^1, s^2 \in \ALPHABET S$ such that $s^1 > s^2$, \[ \EXP[f(S_{t+1}) | S_t = s^1] \ge \EXP[ f(S_{t+1}) | S_t = s^2], \] if and only if $P$ is stochatically monotone.

7.3 Monotonicity of value functions

Theorem 7.3 Consider an MDP where the state space $\ALPHABET S$ is totally ordered. Suppose the following conditions are satisfied.

C1. For every $a \in \ALPHABET A$, the per-step cost $c_t(s,a)$ is weakly inceasing in $s$.

C2. For every $a \in \ALPHABET A$, the transition matrix $P(a)$ is stochastically monotone.

Then, the value function $V_t(s)$ is weakly increasing in $s$.

Note

The result above also applies to models with continuous (and totally ordered) state space provided the measurable selection conditions hold so that the arg min at each step of the dynamic program is attained.

Proof

We proceed by backward induction. By definition, $V_{T+1}(s) = 0$, which is weakly increasing. This forms the basis of induction. Assume that $V_{t+1}(s)$ is weakly increasing. Now consider, \[Q_t(s,a) = c_t(s,a) + \EXP[V_{t+1}(S_{t+1}) | S_t = s, A_t = a].\] For any $a \in \ALPHABET A$, $Q_t(s,a)$ is a sum of two weakly increasing functions in $s$; hence $Q_t(s,a)$ is weakly increasing in $s$.

Now consider $s_1, s_2 \in \ALPHABET S$ such that $s_1 > s_2$. Suppose $a_1^*$ is the optimal action at state $s_1$. Then \[ V_t(s^1) = Q_t(s^1, a_1^*) \stackrel{(a)}\ge Q_t(s^2,a_1^*) \stackrel{(b)}\ge V_t(s_2), \] where $(a)$ follows because $Q_t(\cdot, u^*)$ is weakly increasing and $(b)$ follows from the definition of the value function.

7.4 Submodularity

Submodularity

Let $\ALPHABET X$ and $\ALPHABET Y$ be partially ordered sets. A function $f \colon \ALPHABET X \times \ALPHABET Y \to \reals$ is called submodular if for any $x^+ \ge x^-$ and $y^+ \ge y^-$, we have \[\begin{equation}\label{eq:submodular} f(x^+, y^+) + f(x^-, y^-) \le f(x^+, y^-) + f(x^-, y^+). \end{equation}\]

The function is called supermodular if the inequality in \eqref{eq:submodular} is reversed.

A continuous and differentiable function on $\reals^2$ is submodular iff \[ \frac{ \partial^2 f(x,y) }{ \partial x \partial y } \le 0, \quad \forall x,y. \] If the inequality is reversed, then the function is supermodular.

Submodularity is a useful property because it implies monotonicity of the arg min.

Theorem 7.4 Let $\ALPHABET X$ be a partially ordered set, $\ALPHABET Y$ be a totally ordered set, and $f \colon \ALPHABET X \times \ALPHABET Y \to \reals$ be a submodular function. Suppose that for all $x$, $\arg \min_{y \in \ALPHABET Y} f(x,y)$ exists. Then, \[ π(x) := \max \{ y^* \in \arg \min_{y \in \ALPHABET Y} f(x,y) \} \] is weakly increasing in $x$.

Proof

Consider $x^+, x^- \in \ALPHABET X$ such that $x^+ \ge x^-$. Since $f$ is submodular, for any $y \le π(x^-)$, we have \[\begin{equation}\label{eq:1} f(x^+, π(x^-)) - f(x^+, y) \le f(x^-, π(x^-)) - f(x^-, y) \le 0, \end{equation}\] where the last inequality follows because $π(x^-)$ is the arg min of $f(x^-, y)$. Eq. \eqref{eq:1} implies that for all $y \le π(x^-)$, \[ f(x^+, π(x^-)) \le f(x^+, y). \] Thus, $π(x^+) \ge π(x^-)$.

The analogue of Theorem 7.4 for supermodular functions is as follows.

Theorem 7.5 Let $\ALPHABET X$ be a partially ordered set, $\ALPHABET Y$ be a totally ordered set, and $f \colon \ALPHABET X \times \ALPHABET Y \to \reals$ be a supermodular function. Suppose that for all $x$, $\arg \min_{y \in \ALPHABET Y} f(x,y)$ exists. Then, \[ π(x) := \min \{ y^* \in \arg \min_{y \in \ALPHABET Y} f(x,y) \} \] is weakly decreasing in $x$.

Proof

The proof is similar to Theorem 7.4.

Consider $x^+, x^- \in \ALPHABET X$ such that $x^+ \ge x^-$. Since $f$ is supermodular, for any $y \ge π(x^-)$, we have \[\begin{equation}\label{eq:2} f(x^+, y) - f(x^+, π(x^-)) \ge f(x^-, y) - f(x^-, π(x^-)) \ge 0, \end{equation}\] where the last inequality follows because $π(x^-)$ is the arg min of $f(x^-, y)$. Eq. \eqref{eq:2} implies that for all $y \ge π(x^-)$, \[ f(x^+, y) \ge f(x^+, π(x^-)). \] Thus, $π(x^+) \le π(x^-)$.

7.5 Monotonicity of optimal policy

Theorem 7.6 Consider an MDP where the state space $\ALPHABET S$ and the action space $\ALPHABET A$ are totally ordered. Suppose that, in addition to (C1) and (C2), the following condition is satisfied.

C3. For any weakly increasing function $v$, \[ c_t(s,a) + \EXP[ v(S_{t+1}) | S_t = s, A_t = a]\] is submodular in $(s,a)$.

Let $π^*_t(s) = \max\{ a^* \in \arg \min_{a \in \ALPHABET A} Q_t(s,a) \}$. Then, $π^*(s)$ is weakly increasing in $s$.

Proof

Conditions (C1) and (C2) imply that the value function $V_{t+1}(s)$ is weakly increasing. Therefore, condition (C3) implies that $Q_t(s,a)$ is submodular in $(s,a)$. Therefore, the arg min is weakly increasing in $x$

It is difficult to verify condition (C3). The following conditions are sufficient for (C3).

Lemma 7.1 Consider an MDP with totally ordered state and action spaces. Suppose

$c_t(s,a)$ is submodular in $(s,a)$.
For all $s' \in \ALPHABET S$, $H(s' | s,a) = 1 - \sum_{z \le s'} P_{sz}(a)$ is submodular in $(s,a)$.

The condition (C3) of the previous theorem holds.

Proof

Consider $s^+, s^- \in \ALPHABET S$ and $a^+, a^- \in \ALPHABET A$ such that $s^+ > s^-$ and $a^+ > a^-$. Define

\[\begin{align*} μ_1(s) &= \tfrac 12 P_{s^- s}(a^-) + \tfrac 12 P_{s^+ s}(a^+), \\ μ_2(s) &= \tfrac 12 P_{s^- s}(a^+) + \tfrac 12 P_{s^+ s}(a^-). \end{align*}\] Since $H(s' | s,a)$ is submodular, we have \[ H(s' | s^+, a^+) + H(s' | s^-, a^-) \le H(s' | s^+, a^-) + H(s' | s^-, a^+) \] or equivalently, \[\sum_{z \le s'} \big[ P_{s^+ z}(a^+) + P_{s^- z}(a^-) \big] \ge \sum_{z \le s'} \big[ P_{s^+ z}(a^-) + P_{s^- z}(a^+) \big]. \] which implies \[ M_1(s') \ge M_2(s')\] where $M_1$ and $M_2$ are the CDFs of $μ_1$ and $μ_2$. Thus, $μ_1 \preceq_s μ_2$.

Hence, for any weakly increasing function $v \colon \ALPHABET S \to \reals$, \[ \sum_{s' \in \ALPHABET S} μ_1(s') v(s') \le \sum_{s' \in \ALPHABET S} μ_2(s') v(s').\] Or, equivalently, \[H(s^+, a^+) + H(s^-, a^-) \le H(s^-, a^+) + H(s^+, a^-)\] where $H(s,a) = \EXP[ v(X_{t+1}) | X_t = s, U_t = a]$.

Therefore, $c_t(s,a) + H_t(s,a)$ is submodular in $(s,a)$.

The analogue of Theorem 7.6 for supermodular functions is as follows.

Theorem 7.7 Consider an MDP where the state space $\ALPHABET S$ and the action space $\ALPHABET A$ are totally ordered. Suppose that, in addition to (C1) and (C2), the following condition is satisfied.

C4. For any weakly increasing function $v$, \[ c_t(s,a) + \EXP[ v(S_{t+1}) | S_t = s, A_t = a]\] is supermodular in $(s,a)$.

Let $π^*_t(s) = \min\{ a^* \in \arg \min_{a \in \ALPHABET S} Q_t(s,a) \}$. Then, $π^*(s)$ is weakly decreasing in $s$.

Proof

Conditions (C1) and (C2) imply that the value function $V_{t+1}(s)$ is weakly increasing. Therefore, condition (C4) implies that $Q_t(s,a)$ is supermodular in $(s,a)$. Therefore, the arg min is decreasing in $s$

It is difficult to verify condition (C4). The following conditions are sufficient for (C4).

Lemma 7.2 Consider an MDP with totally ordered state and action spaces. Suppose

$c_t(s,a)$ is supermodular in $(s,a)$.
For all $s' \in \ALPHABET S$, $H(s' | s,a) = 1 - \sum_{z \le s'} P_{sz}(a)$ is supermodular in $(s,a)$.

The condition (C4) of the previous theorem holds.

Proof

Consider $s^+, s^- \in \ALPHABET S$ and $a^+, a^- \in \ALPHABET A$ such that $s^+ > s^-$ and $a^+ > a^-$. Define

\[\begin{align*} μ_1(s) &= \tfrac 12 P_{s^- s}(a^-) + \tfrac 12 P_{s^+ s}(a^+), \\ μ_2(s) &= \tfrac 12 P_{s^- s}(a^+) + \tfrac 12 P_{s^+ s}(a^-). \end{align*}\] Since $H(s' | s,a)$ is supermodular, we have \[ H(s' | s^+, a^+) + H(s' | s^-, a^-) \ge H(s' | s^+, a^-) + H(s' | s^-, a^+) \] or equivalently, \[\sum_{s' \le s'} \big[ P_{s^+ s'}(a^+) + P_{s^- s'}(a^-) \big] \le \sum_{s' \le s'} \big[ P_{s^+ s'}(a^-) + P_{s^- s'}(a^+) \big]. \] which implies \[ M_1(s') \le M_2(s')\] where $M_1$ and $M_2$ are the CDFs of $μ_1$ and $μ_2$. Thus, $μ_1 \succeq_s μ_2$.

Hence, for any weakly increasing function $v \colon \ALPHABET S \to \reals$, \[ \sum_{s' \in \ALPHABET S} μ_1(s') v(s') \ge \sum_{s' \in \ALPHABET S} μ_2(s') v(s').\] Or, equivalently, \[H(s^+, a^+) + H(s^-, a^-) \ge H(s^-, a^+) + H(s^+, a^-)\] where $H(s,a) = \EXP[ v(X_{t+1}) | X_t = s, U_t = a]$.

Therefore, $c_t(s,a) + H_t(s,a)$ is supermodular in $(s,a)$.

7.6 Constraints on actions

In the results above, we have assumed that the action set $\ALPHABET A$ is the same for all states. The results also extend to the case when the action at state $s$ must belong to some set $\ALPHABET A(s)$ provided the following conditions are satisfied:

For any $s \ge s'$, $\ALPHABET A(s) \supseteq \ALPHABET A(s')$
For any $s \in \ALPHABET S$ and $a \in \ALPHABET A(s)$, $a' < a$ implies that $a' \in \ALPHABET A(s)$.

7.7 Monotone dynamic programming

If we can establish that the optimal policy is monontone, then we can use this structure to implement the dynamic program more efficient. Suppose $\ALPHABET S = \{1, \dots, n\}$ and $\ALPHABET A = \{1, \dots. m\}$. The main idea is as follows. Suppose $V_{t+1}(\cdot)$ has been caclulated. Insead of computing $Q_t(s,a)$ and $V_t(s)$, proceed as follows:

Set $s = 1$ and $α = 1$.
For all $u \in \{α, \dots, m\}$, compute $Q_t(s,a)$ as usual.
Compute

\[V_t(s) = \min_{ α \le a \le m } Q_t(s,a)\]

and set

\[π_t^*(s) = \max \{ a \in \{α, \dots, m\} : V_t(s) = Q_t(s,a) \}.\]
If $s = n$, then stop. Otherwise, set $α = π_t^*(s)$ and $s = s+1$ and go to step 2.

7.8 Example: A machine replacement model

Let’s revisit the machine replacement problem presented at the beginning of this section. For simplicity, we’ll assume that $n = ∞$, i.e., the state space is countable. In this case, the transition matrices are given by \[ P_{sz}(0) = \begin{cases} 0, & z < s \\ μ_{z - s}, & z \ge s \end{cases} \quad\text{and}\quad P_sz(1) = μ_z. \] where $μ$ is the PMF of $W$.

Proposition 7.1 For the machine replacement problem, there exist a series of thresholds $\{s^*_t\}_{t = 1}^T$ such that the optimal policy at time $t$ is a threshold policy with threshold $s_t$, i.e., \[ π_t(s) = \begin{cases} 0 & \text{if $s < s_t^*$} \\ 1 & \text{otherwise} \end{cases}\]

Proof

We prove the result by verifying conditions (C1)–(C4) to establish that the optimal policy is monotone.

C1. For $a = 0$, $c(s,0) = h(s)$, which is weakly increasing by assumption. For $a = 1$, $c(s,1) = K$, which is trivially weakly increasing.

C2. For $a = 0$, $P(0)$ is stochastically monotone (because the CDF of $P(\cdot | s, 0)$ lies above the CDF of $P(\cdot | s+1, 0)$). For $a = 1$, all rows of $P(1)$ are the same; therefore $P(1)$ is stochastically monotone.

Since (C1) and (C2) are satisfied, by Theorem 7.3, we can assert that the value function is weakly increasing.

C3. $c(s,1) - c(s,0) = K - h(s)$, which is weakly decreasing in $s$. Therefore, $c(s,a)$ is submodular in $(s,a)$.

C4. Recall that $H(s'|s,a) = 1 - \sum_{z \le s'} P_{sz}(a).$ Therefore,

\[H(s'|s,0) = 1 - \sum_{z = s}^{s'} μ_{z -s} = 1 - \sum_{k = 0}^{s' - s} μ_k = 1 - M_{s' - s},\] where $M$ is the CMF of $μ$, and \[H(s'|s,1) = 1 - \sum_{z \le s'} μ_z = 1 - M_{s'},\]

Therefore, $H(s'|s,1) - H(s'|s,0) = M_{s'-s} - M_{s'}$. For any fixed $s'$, $H(s'|s,1) - H(s'|s,0)$ is weakly decreasing in $s$. There $H(s'|s,a)$ is submodular in $(s,a)$.

Since (C1)–(C4) are satisfied, the optimal policy is weakly increasing in~$s$. Since there are only two actions, it means that for every time, there exists a state $s^*_t$ with the property that if $s$ exceeds $s^*_t$, the optimal decision is to replace the machine; and if $s \le s^*_t$, then the optimal decision is to operate the machine for another period.

Exercises

Exercise 7.1 Let $T$ denote a upper triangular matrix with 1’s on or below the diagonal and 0’s above the diagonal. Then \[ T^{-1}_{ij} = \begin{cases} 1, & \text{if } i = j, \\ -1, & \text{if } i = j + 1, \\ 0, & \text{otherwise}. \end{cases}\]

For example, for a $4 \times 4$ matrix \[ T = \MATRIX{1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1}, \quad T^{-1} = \MATRIX{1 & -1 & 0 & 0 \\ 0 & 1 & -1 & 0 \\ 0 & 0 & 1 & -1 \\ 0 & 0 & 0 & 1 }. \]

Show the following:

For any two PMFs $μ^1$ and $μ^2$, $\mu^1 \succeq_s \mu^2$ iff $\mu^1 T \ge \mu^2 T$.
A Markov transition matrix $P$ is stochastic monotone iff $T^{-1} P T \ge 0$.

Exercise 7.2 Show that the following are equivalent:

A transition matrix $P$ is stochastically monotone
For any two PMFs $μ^1$ and $μ^2$, if $\mu^1 \succeq_s \mu^2$ then $\mu^1P \succeq_s \mu^2P$.

Exercise 7.3 Show that if two transition matrices $P$ and $Q$ have the same dimensions and are stochastically monotone, then so are:

$\lambda P + (1 - \lambda) Q$, where $\lambda \in (0,1)$.
$P Q$
$P^k$, for $k \in \integers_{> 0}$.

Exercise 7.4 Let $\mu_t$ denote the distribution of a Markov chain at time $t$. Suppose $\mu_0 \succeq_s \mu_1$. Then $\mu_t \succeq_s \mu_{t+1}$.

Exercise 7.5 Consider the example of machine repair presented in notes on matrix formulation of MDPs. Prove that the optimal policy for that model is weakly increasing.

Exercise 7.6 Suppose the state space $\ALPHABET S$ is a symmetric subset of integers of the form $\{-L, -L + 1, \dots, L-1, L\}$ and the action space $\ALPHABET A$ is discrete. Let $\ALPHABET X_{\ge 0}$ denote the set $\{0, \dots, L\}$.

Let $P(a)$ denote the controlled transition matrix and $c_t(s,a)$ denote the per-step cost. To avoid ambiguity, we define the optimal policy as \[ π^*_t(s) = \begin{cases} \max\bigl\{ a' \in \arg\min_{a \in \ALPHABET A} Q_t(s,a) \bigr\}, & \text{if } s \ge 0 \\ \min\bigl\{ a' \in \arg\min_{a \in \ALPHABET A} Q_t(s,a) \bigr\}, & \text{if } s < 0 \end{cases}\] The purpose of this exercise is to identify conditions under which the value function and the optimal policy are even and :quasi-convex. We do so using the following steps.

We say that the transition probability matrix $P(a)$ is even if for all $s, s' \in \ALPHABET S$, $P(s'|s,a) = P(-s'|-s,a)$. Prove the following result.

Proposition 7.2 Suppose the MDP satisfies the following properties:

(A1) For every $t$ and $a \in \ALPHABET A$, $c_t(s,a)$ is even function of $s$.

(A2) For every $a \in \ALPHABET A$, $P(a)$ is even.

Then, for all $t$, $V_t$ and $π_t$ are even functions.

Given any probability mass function $μ$ on $\ALPHABET S$, define the folded probability mass function $\tilde μ$ on $\ALPHABET X_{\ge 0}$ as follows: \[ \tilde μ(s) = \begin{cases} μ(0), & \text{if } s = 0 \\ μ(s) + μ(-s), & \text{if } s > 0. \end{cases} \]

For ease of notation, we use $\tilde μ = \mathcal F μ$ to denote this folding operation. Note that an immediate consequence of the definition is the following (you don’t have to prove this).

Lemma 7.3 If $f \colon \ALPHABET S \to \reals$ is even, then for any probability mass function $μ$ on $\ALPHABET S$ and $\tilde μ = \mathcal F μ$, we have \[ \sum_{s \in \ALPHABET S} f(s) μ(s) = \sum_{s \in \ALPHABET X_{\ge 0}} f(s) \tilde μ(s). \]

Thus, the expectation of the function $f \colon \ALPHABET S \to \reals$ with respect to the PMF $μ$ is equal to the expectation of the function $f \colon \ALPHABET X_{\ge 0} \to \reals$ with respect to the PMF $\tilde μ = \mathcal F μ$.

Now given any probability transition matrix $P$ on $\ALPHABET S$, we can define a probability transition matrix $\tilde P$ on $\ALPHABET X_{\ge 0}$ as follows: for any $s \in \ALPHABET S$, $\tilde P_s = \mathcal F P_s$, where $P_s$ denotes the $s$-th row of $P$. For ease of notation, we use $\tilde P = \mathcal F P$ to denote this relationship.

Now prove the following:

Proposition 7.3 Given the MDP $(\ALPHABET S, \ALPHABET A, P, \{c_t\})$, define the folded MDP as $(\ALPHABET S_{\ge 0}, \ALPHABET A, \tilde P, \{c_t\})$, where $\tilde P(a) = \mathcal F P(a)$ for all $a \in \ALPHABET A$. Let $\tilde Q_t \colon \ALPHABET S_{\ge 0} \times \ALPHABET A \to \reals$, $\tilde V_t \colon \ALPHABET S_{\ge 0} \to \reals$ and $\tilde π_t^* \colon \ALPHABET S_{\ge 0} \to \ALPHABET A$ denote the action-value function, value function and the policy of the folded MDP. Then, if the original MDP satisfies conditions (A1) and (A2) then, for any $s \in \ALPHABET S$ and $a \in \ALPHABET A$, \[ Q_t(s,a) = \tilde Q_t(|s|, a), \quad V_t(s) = \tilde V_t(|s|), \quad π_t^*(s) = \tilde π_t^*(|s|). \]

The result of the previous part implies that if the value function $\tilde V_t$ and the policy $\tilde π^*_t$ are monotone increasing, then the value function $V_t$ and the policy $π^*_t$ are even and quasi-convex. This gives us a method to verify if the value function and optimal policy are even and quasi-convex.

Now, recall the model of the Internet of Things presented in Q2 of Assignment 3. The numerical experiments that you did in Assignment 3 suggest that the value function and the optimal policy are even and quasi-convex. Prove that this is indeed the case.
Now suppose the distribution of $W_t$ is not Gaussian but is some general probability density $\varphi(\cdot)$ and the cost function is \[ c(e,a) = \lambda a + (1 - a) d(e). \] Find conditions on $\varphi$ and $d$ such that the value function and optimal policy are even and quasi-convex.

Notes

Stochastic dominance has been employed in various areas of economics, finance, and statistics since the 1930s. See Levy (1992) and Levy (2015) for detailed overviews. The notion of stochastic monotonicity for Markov chains is due to Daley (1968). For a generalization of stochastic monotonicity to continuous state spaces, see Serfozo (1976). The characterization of stochastic monotonicity in Exercise 7.1–Exercise 7.4 are due to Keilson and Kester (1977).

Ross (1974) has an early treatment of monotonicity of optimal policies. The general theory was developed by Topkis (1998). The presentation here follows Puterman (2014). Exercise 7.6 is from Chakravorty and Mahajan (2018).