T = 20
n = 10
c = 2
K = 20
h = function(s) { return c*s }
Pw = {
const n = 10
var points = new Array(n+1)
var cumulative = 0
var probability = 0
for(var k = 0; k <= n; k++) {
probability = binomial(n,k,p)
cumulative += probability
points[k] = { probability: probability, cumulative: cumulative }
}
return points
}
binomial = {
// From https://stackoverflow.com/a/37715980/193149
const logf = [0, 0, 0.6931471805599453, 1.791759469228055, 3.1780538303479458, 4.787491742782046, 6.579251212010101, 8.525161361065415, 10.60460290274525, 12.801827480081469, 15.104412573075516, 17.502307845873887, 19.987214495661885, 22.552163853123425, 25.19122118273868, 27.89927138384089, 30.671860106080672, 33.50507345013689, 36.39544520803305, 39.339884187199495, 42.335616460753485, 45.38013889847691, 48.47118135183523, 51.60667556776438, 54.78472939811232, 58.00360522298052, 61.261701761002, 64.55753862700634, 67.88974313718154, 71.25703896716801, 74.65823634883016, 78.0922235533153, 81.55795945611504, 85.05446701758152, 88.58082754219768, 92.1361756036871, 95.7196945421432, 99.33061245478743, 102.96819861451381, 106.63176026064346, 110.32063971475739, 114.0342117814617, 117.77188139974507, 121.53308151543864, 125.3172711493569, 129.12393363912722, 132.95257503561632, 136.80272263732635, 140.67392364823425, 144.5657439463449, 148.47776695177302, 152.40959258449735, 156.3608363030788, 160.3311282166309, 164.32011226319517, 168.32744544842765, 172.3527971391628, 176.39584840699735, 180.45629141754378, 184.53382886144948, 188.6281734236716, 192.7390472878449, 196.86618167289, 201.00931639928152, 205.1681994826412, 209.34258675253685, 213.53224149456327, 217.73693411395422, 221.95644181913033, 226.1905483237276, 230.43904356577696, 234.70172344281826, 238.97838956183432, 243.2688490029827, 247.57291409618688, 251.8904022097232, 256.22113555000954, 260.5649409718632, 264.9216497985528, 269.2910976510198, 273.6731242856937, 278.0675734403661, 282.4742926876304, 286.893133295427, 291.3239500942703, 295.76660135076065, 300.22094864701415, 304.6868567656687, 309.1641935801469, 313.65282994987905, 318.1526396202093, 322.66349912672615, 327.1852877037752, 331.7178871969285, 336.26118197919845, 340.815058870799, 345.37940706226686, 349.95411804077025, 354.5390855194408, 359.1342053695754, 363.73937555556347]
return function(n, k, p) {
return Math.exp(logf[n] - logf[n-k] - logf[k]) * p**k * (1-p)**(n-k)
}
}
DP = {
var DP = new Array()
var idx = 0
var V = new Array(n+1)
var Q0 = new Array(n+1)
var Q1 = new Array(n+1)
var a = 0
var val = 0
// Initialize the terminal value function
for (var s = 0; s <= n; s++) {
V[1+s] = 0
}
// Dynamic Programming
for (var t = T; t >= 1; t--) {
//Q0[s] = h[s] + E[ V(s+W) ]
//Q1[s] = K + E[ V(W) ]
for (var s = 0; s <= n; s++) {
Q0[1+s] = h(s)
Q1[1+s] = K
for (var w=0; w <= n; w++) {
var s_next = Math.min(s+w, n)
Q0[1+s] += V[ 1 + s_next ]*Pw[w].probability
Q1[1+s] += V[ 1 + w ]*Pw[w].probability
}
}
for (var s = 0; s <= n; s++) {
if (Q0[1+s] <= Q1[1+s]) {
a = 0
val = Q0[1+s]
} else {
a = 1
val = Q1[1+s]
}
DP[idx++] = { time: t, state: s, value: val, action: a }
V[1+s] = val
}
}
return DP;
}
7 Monotonicity of value function and optimal policies
Consider a manufacturing process, where the machine used for manufacturing deteriorates over time. Let \(\ALPHABET S = \{0, 1, \dots n \}\) represent the condition of the machine. The higher the value of \(s\), the worse the condition of the equipment.
A decision maker observes the state of the machine and has two options: continue operating the machine or replace it with a a new and identical piece of equipment. Operating the machine is state \(s\) costs \(h(s)\), where \(h(⋅)\) is a weakly increasing function; replace the machine costs a constant amount \(K\).
When the machine is operated, it’s state deteriorates according to \[ S_{t+1} = \min( S_t + W_t , n) \] where \(\{W_t\}_{t \ge 1}\) is an i.i.d. process with PMF \(μ\).
We model the above model as an MDP with state space \(\ALPHABET S\) and action space \(\ALPHABET A = \{0, 1\}\) where \(0\) means operating the machine and \(1\) means replacing the machine.
We compute the optimal policy for this model when \(W \sim \text{Binom}(n,p)\) when \(n = 10\), \(T = 20\), \(h(s) = 2s\), and \(K=20\) for different values of \(p\).
7.1 Stochastic dominance
Let \(\ALPHABET S\) be a totally ordered finite set, say \(\{1, \dots, n\}\).
Stochastic dominance is a partial order on random variables defined on totally ordered sets
Let \({\rm M}^1\) and \({\rm M}^2\) denote the CDF of \(\mu^1\) and \(\mu^2\). Then \eqref{eq:inc-prob} is equivalent to the following: \[\begin{equation}\label{eq:cdf} {\rm M}^1_s \le {\rm M}^2_s, \quad \forall s \in \ALPHABET S. \end{equation}\] Thus, visually, \(S^1 \succeq_s S^2\) means that the CDF of \(S^1\) lies below the CDF of \(S^2\).
Stochastic dominance is important due to the following property.
Theorem 7.1 Let \(f \colon \ALPHABET S \to \reals\) be a (weakly) increasing function and \(S^1 \sim \mu^1\) and \(S^2 \sim \mu^2\) are random variables defined on \(\ALPHABET S\). Then \(S^1 \succeq_s S^2\) if and only if \[\begin{equation}\label{eq:inc-fun} \EXP[f(S^1)] \ge \EXP[f(S^2)]. \end{equation}\]
7.2 Stochastic monotonicity
Stochastic monotonicity extends the notion of stochastic dominance to Markov chains. Suppose \(\ALPHABET S\) is a totally ordered set and \(\{S_t\}_{t \ge 1}\) is a time-homogeneous Markov chain on \(\ALPHABET S\) with transition probability matrix \(P\). Let \(P_i\) denote the \(i\)-th row of \(P\). Note that \(P_i\) is a PMF.
An immediate implication is the following.
Theorem 7.2 Let \(\{S_t\}_{t \ge 1}\) be a Markov chain with transition matrix \(P\) and \(f \colon \ALPHABET S \to \reals\) is a weakly increasing function. Then, for any \(s^1, s^2 \in \ALPHABET S\) such that \(s^1 > s^2\), \[ \EXP[f(S_{t+1}) | S_t = s^1] \ge \EXP[ f(S_{t+1}) | S_t = s^2], \] if and only if \(P\) is stochatically monotone.
7.3 Monotonicity of value functions
Theorem 7.3 Consider an MDP where the state space \(\ALPHABET S\) is totally ordered. Suppose the following conditions are satisfied.
C1. For every \(a \in \ALPHABET A\), the per-step cost \(c_t(s,a)\) is weakly inceasing in \(s\).
C2. For every \(a \in \ALPHABET A\), the transition matrix \(P(a)\) is stochastically monotone.
Then, the value function \(V_t(s)\) is weakly increasing in \(s\).
7.4 Submodularity
A continuous and differentiable function on \(\reals^2\) is submodular iff \[ \frac{ \partial^2 f(x,y) }{ \partial x \partial y } \le 0, \quad \forall x,y. \] If the inequality is reversed, then the function is supermodular.
Submodularity is a useful property because it implies monotonicity of the arg min.
Theorem 7.4 Let \(\ALPHABET X\) be a partially ordered set, \(\ALPHABET Y\) be a totally ordered set, and \(f \colon \ALPHABET X \times \ALPHABET Y \to \reals\) be a submodular function. Suppose that for all \(x\), \(\arg \min_{y \in \ALPHABET Y} f(x,y)\) exists. Then, \[ π(x) := \max \{ y^* \in \arg \min_{y \in \ALPHABET Y} f(x,y) \} \] is weakly increasing in \(x\).
The analogue of Theorem 7.4 for supermodular functions is as follows.
Theorem 7.5 Let \(\ALPHABET X\) be a partially ordered set, \(\ALPHABET Y\) be a totally ordered set, and \(f \colon \ALPHABET X \times \ALPHABET Y \to \reals\) be a supermodular function. Suppose that for all \(x\), \(\arg \min_{y \in \ALPHABET Y} f(x,y)\) exists. Then, \[ π(x) := \min \{ y^* \in \arg \min_{y \in \ALPHABET Y} f(x,y) \} \] is weakly decreasing in \(x\).
7.5 Monotonicity of optimal policy
Theorem 7.6 Consider an MDP where the state space \(\ALPHABET S\) and the action space \(\ALPHABET A\) are totally ordered. Suppose that, in addition to (C1) and (C2), the following condition is satisfied.
C3. For any weakly increasing function \(v\), \[ c_t(s,a) + \EXP[ v(S_{t+1}) | S_t = s, A_t = a]\] is submodular in \((s,a)\).
Let \(π^*_t(s) = \max\{ a^* \in \arg \min_{a \in \ALPHABET A} Q_t(s,a) \}\). Then, \(π^*(s)\) is weakly increasing in \(s\).
It is difficult to verify condition (C3). The following conditions are sufficient for (C3).
Lemma 7.1 Consider an MDP with totally ordered state and action spaces. Suppose
- \(c_t(s,a)\) is submodular in \((s,a)\).
- For all \(s' \in \ALPHABET S\), \(H(s' | s,a) = 1 - \sum_{z \le s'} P_{sz}(a)\) is submodular in \((s,a)\).
The condition (C3) of the previous theorem holds.
The analogue of Theorem 7.6 for supermodular functions is as follows.
Theorem 7.7 Consider an MDP where the state space \(\ALPHABET S\) and the action space \(\ALPHABET A\) are totally ordered. Suppose that, in addition to (C1) and (C2), the following condition is satisfied.
C4. For any weakly increasing function \(v\), \[ c_t(s,a) + \EXP[ v(S_{t+1}) | S_t = s, A_t = a]\] is supermodular in \((s,a)\).
Let \(π^*_t(s) = \min\{ a^* \in \arg \min_{a \in \ALPHABET S} Q_t(s,a) \}\). Then, \(π^*(s)\) is weakly decreasing in \(s\).
It is difficult to verify condition (C4). The following conditions are sufficient for (C4).
Lemma 7.2 Consider an MDP with totally ordered state and action spaces. Suppose
- \(c_t(s,a)\) is supermodular in \((s,a)\).
- For all \(s' \in \ALPHABET S\), \(H(s' | s,a) = 1 - \sum_{z \le s'} P_{sz}(a)\) is supermodular in \((s,a)\).
The condition (C4) of the previous theorem holds.
7.6 Constraints on actions
In the results above, we have assumed that the action set \(\ALPHABET A\) is the same for all states. The results also extend to the case when the action at state \(s\) must belong to some set \(\ALPHABET A(s)\) provided the following conditions are satisfied:
- For any \(s \ge s'\), \(\ALPHABET A(s) \supseteq \ALPHABET A(s')\)
- For any \(s \in \ALPHABET S\) and \(a \in \ALPHABET A(s)\), \(a' < a\) implies that \(a' \in \ALPHABET A(s)\).
7.7 Monotone dynamic programming
If we can establish that the optimal policy is monontone, then we can use this structure to implement the dynamic program more efficient. Suppose \(\ALPHABET S = \{1, \dots, n\}\) and \(\ALPHABET A = \{1, \dots. m\}\). The main idea is as follows. Suppose \(V_{t+1}(\cdot)\) has been caclulated. Insead of computing \(Q_t(s,a)\) and \(V_t(s)\), proceed as follows:
Set \(s = 1\) and \(α = 1\).
For all \(u \in \{α, \dots, m\}\), compute \(Q_t(s,a)\) as usual.
Compute
\[V_t(s) = \min_{ α \le a \le m } Q_t(s,a)\]
and set
\[π_t^*(s) = \max \{ a \in \{α, \dots, m\} : V_t(s) = Q_t(s,a) \}.\]
If \(s = n\), then stop. Otherwise, set \(α = π_t^*(s)\) and \(s = s+1\) and go to step 2.
7.8 Example: A machine replacement model
Let’s revisit the machine replacement problem presented at the beginning of this section. For simplicity, we’ll assume that \(n = ∞\), i.e., the state space is countable. In this case, the transition matrices are given by \[ P_{sz}(0) = \begin{cases} 0, & z < s \\ μ_{z - s}, & z \ge s \end{cases} \quad\text{and}\quad P_sz(1) = μ_z. \] where \(μ\) is the PMF of \(W\).
Proposition 7.1 For the machine replacement problem, there exist a series of thresholds \(\{s^*_t\}_{t = 1}^T\) such that the optimal policy at time \(t\) is a threshold policy with threshold \(s_t\), i.e., \[ π_t(s) = \begin{cases} 0 & \text{if $s < s_t^*$} \\ 1 & \text{otherwise} \end{cases}\]
Exercises
Exercise 7.1 Let \(T\) denote a upper triangular matrix with 1’s on or below the diagonal and 0’s above the diagonal. Then \[ T^{-1}_{ij} = \begin{cases} 1, & \text{if } i = j, \\ -1, & \text{if } i = j + 1, \\ 0, & \text{otherwise}. \end{cases}\]
For example, for a \(4 \times 4\) matrix \[ T = \MATRIX{1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1}, \quad T^{-1} = \MATRIX{1 & -1 & 0 & 0 \\ 0 & 1 & -1 & 0 \\ 0 & 0 & 1 & -1 \\ 0 & 0 & 0 & 1 }. \]
Show the following:
- For any two PMFs \(μ^1\) and \(μ^2\), \(\mu^1 \succeq_s \mu^2\) iff \(\mu^1 T \ge \mu^2 T\).
- A Markov transition matrix \(P\) is stochastic monotone iff \(T^{-1} P T \ge 0\).
Exercise 7.2 Show that the following are equivalent:
- A transition matrix \(P\) is stochastically monotone
- For any two PMFs \(μ^1\) and \(μ^2\), if \(\mu^1 \succeq_s \mu^2\) then \(\mu^1P \succeq_s \mu^2P\).
Exercise 7.3 Show that if two transition matrices \(P\) and \(Q\) have the same dimensions and are stochastically monotone, then so are:
- \(\lambda P + (1 - \lambda) Q\), where \(\lambda \in (0,1)\).
- \(P Q\)
- \(P^k\), for \(k \in \integers_{> 0}\).
Exercise 7.4 Let \(\mu_t\) denote the distribution of a Markov chain at time \(t\). Suppose \(\mu_0 \succeq_s \mu_1\). Then \(\mu_t \succeq_s \mu_{t+1}\).
Exercise 7.5 Consider the example of machine repair presented in notes on matrix formulation of MDPs. Prove that the optimal policy for that model is weakly increasing.
Exercise 7.6 Suppose the state space \(\ALPHABET S\) is a symmetric subset of integers of the form \(\{-L, -L + 1, \dots, L-1, L\}\) and the action space \(\ALPHABET A\) is discrete. Let \(\ALPHABET X_{\ge 0}\) denote the set \(\{0, \dots, L\}\).
Let \(P(a)\) denote the controlled transition matrix and \(c_t(s,a)\) denote the per-step cost. To avoid ambiguity, we define the optimal policy as \[ π^*_t(s) = \begin{cases} \max\bigl\{ a' \in \arg\min_{a \in \ALPHABET A} Q_t(s,a) \bigr\}, & \text{if } s \ge 0 \\ \min\bigl\{ a' \in \arg\min_{a \in \ALPHABET A} Q_t(s,a) \bigr\}, & \text{if } s < 0 \end{cases}\] The purpose of this exercise is to identify conditions under which the value function and the optimal policy are even and :quasi-convex. We do so using the following steps.
- We say that the transition probability matrix \(P(a)\) is even if for all \(s, s' \in \ALPHABET S\), \(P(s'|s,a) = P(-s'|-s,a)\). Prove the following result.
Proposition 7.2 Suppose the MDP satisfies the following properties:
(A1) For every \(t\) and \(a \in \ALPHABET A\), \(c_t(s,a)\) is even function of \(s\).
(A2) For every \(a \in \ALPHABET A\), \(P(a)\) is even.
Then, for all \(t\), \(V_t\) and \(π_t\) are even functions.
- Given any probability mass function \(μ\) on \(\ALPHABET S\), define the folded probability mass function \(\tilde μ\) on \(\ALPHABET X_{\ge 0}\) as follows: \[ \tilde μ(s) = \begin{cases} μ(0), & \text{if } s = 0 \\ μ(s) + μ(-s), & \text{if } s > 0. \end{cases} \]
For ease of notation, we use \(\tilde μ = \mathcal F μ\) to denote this folding operation. Note that an immediate consequence of the definition is the following (you don’t have to prove this).
Lemma 7.3 If \(f \colon \ALPHABET S \to \reals\) is even, then for any probability mass function \(μ\) on \(\ALPHABET S\) and \(\tilde μ = \mathcal F μ\), we have \[ \sum_{s \in \ALPHABET S} f(s) μ(s) = \sum_{s \in \ALPHABET X_{\ge 0}} f(s) \tilde μ(s). \]
Thus, the expectation of the function \(f \colon \ALPHABET S \to \reals\) with respect to the PMF \(μ\) is equal to the expectation of the function \(f \colon \ALPHABET X_{\ge 0} \to \reals\) with respect to the PMF \(\tilde μ = \mathcal F μ\).
Now given any probability transition matrix \(P\) on \(\ALPHABET S\), we can define a probability transition matrix \(\tilde P\) on \(\ALPHABET X_{\ge 0}\) as follows: for any \(s \in \ALPHABET S\), \(\tilde P_s = \mathcal F P_s\), where \(P_s\) denotes the \(s\)-th row of \(P\). For ease of notation, we use \(\tilde P = \mathcal F P\) to denote this relationship.
Now prove the following:
Proposition 7.3 Given the MDP \((\ALPHABET S, \ALPHABET A, P, \{c_t\})\), define the folded MDP as \((\ALPHABET S_{\ge 0}, \ALPHABET A, \tilde P, \{c_t\})\), where \(\tilde P(a) = \mathcal F P(a)\) for all \(a \in \ALPHABET A\). Let \(\tilde Q_t \colon \ALPHABET S_{\ge 0} \times \ALPHABET A \to \reals\), \(\tilde V_t \colon \ALPHABET S_{\ge 0} \to \reals\) and \(\tilde π_t^* \colon \ALPHABET S_{\ge 0} \to \ALPHABET A\) denote the action-value function, value function and the policy of the folded MDP. Then, if the original MDP satisfies conditions (A1) and (A2) then, for any \(s \in \ALPHABET S\) and \(a \in \ALPHABET A\), \[ Q_t(s,a) = \tilde Q_t(|s|, a), \quad V_t(s) = \tilde V_t(|s|), \quad π_t^*(s) = \tilde π_t^*(|s|). \]
The result of the previous part implies that if the value function \(\tilde V_t\) and the policy \(\tilde π^*_t\) are monotone increasing, then the value function \(V_t\) and the policy \(π^*_t\) are even and quasi-convex. This gives us a method to verify if the value function and optimal policy are even and quasi-convex.
Now, recall the model of the Internet of Things presented in Q2 of Assignment 3. The numerical experiments that you did in Assignment 3 suggest that the value function and the optimal policy are even and quasi-convex. Prove that this is indeed the case.
Now suppose the distribution of \(W_t\) is not Gaussian but is some general probability density \(\varphi(\cdot)\) and the cost function is \[ c(e,a) = \lambda a + (1 - a) d(e). \] Find conditions on \(\varphi\) and \(d\) such that the value function and optimal policy are even and quasi-convex.
Notes
Stochastic dominance has been employed in various areas of economics, finance, and statistics since the 1930s. See Levy (1992) and Levy (2015) for detailed overviews. The notion of stochastic monotonicity for Markov chains is due to Daley (1968). For a generalization of stochastic monotonicity to continuous state spaces, see Serfozo (1976). The characterization of stochastic monotonicity in Exercise 7.1–Exercise 7.4 are due to Keilson and Kester (1977).
Ross (1974) has an early treatment of monotonicity of optimal policies. The general theory was developed by Topkis (1998). The presentation here follows Puterman (2014). Exercise 7.6 is from Chakravorty and Mahajan (2018).