ECSE 506: Stochastic Control and Decision Theory
Aditya Mahajan
Winter 2022
About  Lectures  Notes  Coursework
TL;DR General conditions are presented under which the optimal policy is monotone. Such a structural property is useful because it makes it easy to search and implement the optimal policy.
Consider the matrix formulation of MDPs and suppose the state space \(\ALPHABET S\) is totally ordered. In many applications, it is useful to know if the value function is increasing (or decreasing) in state.
 Theorem 1

Consider an MDP where the state space \(\ALPHABET S\) is totally ordered. Suppose the following conditions are satisfied.
C1. For every \(a \in \ALPHABET A\), the perstep cost \(c_t(s,a)\) is weakly inceasing in \(s\).
C2. For every \(a \in \ALPHABET A\), the transition matrix \(P(a)\) is stochastically monotone.
Then, the value function \(V_t(s)\) is weakly increasing in \(s\).
 Note

The result above also applies to models with continuous (and totally ordered) state space provided the measurable selection conditions hold so that the arg min at each step of the dynamic program is attained.
Proof
We proceed by backward induction. By definition, \(V_{T+1}(s) = 0\), which is weakly increasing. This forms the basis of induction. Assume that \(V_{t+1}(s)\) is weakly increasing. Now consider, \[Q_t(s,a) = c_t(s,a) + \EXP[V_{t+1}(S_{t+1})  S_t = s, A_t = a].\] For any \(a \in \ALPHABET A\), \(Q_t(s,a)\) is a sum of two weakly increasing functions in \(s\); hence \(Q_t(s,a)\) is weakly increasing in \(s\).
Now consider \(s_1, s_2 \in \ALPHABET S\) such that \(s_1 > s_2\). Suppose \(a_1^*\) is the optimal action at state \(s_1\). Then \[ V_t(s^1) = Q_t(s^1, a_1^*) \stackrel{(a)}\ge Q_t(s^2,a_1^*) \stackrel{(b)}\ge V_t(s_2), \] where \((a)\) follows because \(Q_t(\cdot, u^*)\) is weakly increasing and \((b)\) follows from the definition of the value function.
1 Submodularity
 Definition

Let \(\ALPHABET X\) and \(\ALPHABET Y\) be partially ordered sets. A function \(f \colon \ALPHABET X \times \ALPHABET Y \to \reals\) is called submodular if for any \(x^+ \ge x^\) and \(y^+ \ge y^\), we have \[\begin{equation}\label{eq:submodular} f(x^+, y^+) + f(x^, y^) \le f(x^+, y^) + f(x^, y^+). \end{equation}\]
The function is called supermodular if the inequality in \eqref{eq:submodular} is reversed.
A continuous and differentiable function on \(\reals^2\) is submodular iff \[ \frac{ \partial^2 f(x,y) }{ \partial x \partial y } \le 0, \quad \forall x,y. \] If the inequality is reversed, then the function is supermodular.
Submodularity is a useful property because it implies monotonicity of the arg min.
 Theorem 2

Let \(\ALPHABET X\) be a partially ordered set, \(\ALPHABET Y\) be a totally ordered set, and \(f \colon \ALPHABET X \times \ALPHABET Y \to \reals\) be a submodular function. Suppose that for all \(x\), \(\arg \min_{y \in \ALPHABET Y} f(x,y)\) exists. Then, \[ π(x) := \max \{ y^* \in \arg \min_{y \in \ALPHABET Y} f(x,y) \} \] is weakly increasing in \(x\).
Proof
Consider \(x^+, x^ \in \ALPHABET X\) such that \(x^+ \ge x^\). Since \(f\) is submodular, for any \(y \le π(x^)\), we have \[\begin{equation}\label{eq:1} f(x^+, π(x^))  f(x^+, y) \le f(x^, π(x^))  f(x^, y) \le 0, \end{equation}\] where the last inequality follows because \(π(x^)\) is the arg min of \(f(x^, y)\). Eq. \eqref{eq:1} implies that for all \(y \le π(x^)\), \[ f(x^+, π(x^)) \le f(x^+, y). \] Thus, \(π(x^+) \ge π(x^)\).
The analogue of Theorem 2 for supermodular functions is as follows.
 Theorem 3

Let \(\ALPHABET X\) be a partially ordered set, \(\ALPHABET Y\) be a totally ordered set, and \(f \colon \ALPHABET X \times \ALPHABET Y \to \reals\) be a supermodular function. Suppose that for all \(x\), \(\arg \min_{y \in \ALPHABET Y} f(x,y)\) exists. Then, \[ π(x) := \min \{ y^* \in \arg \min_{y \in \ALPHABET Y} f(x,y) \} \] is weakly decreasing in \(x\).
Proof
The proof is similar to Theorem 2.
Consider \(x^+, x^ \in \ALPHABET X\) such that \(x^+ \ge x^\). Since \(f\) is supermodular, for any \(y \ge π(x^)\), we have \[\begin{equation}\label{eq:2} f(x^+, y)  f(x^+, π(x^)) \ge f(x^, y)  f(x^, π(x^)) \ge 0, \end{equation}\] where the last inequality follows because \(π(x^)\) is the arg min of \(f(x^, y)\). Eq. \eqref{eq:2} implies that for all \(y \ge π(x^)\), \[ f(x^+, y) \ge f(x^+, π(x^)). \] Thus, \(π(x^+) \le π(x^)\).
2 Monotonicity of optimal policy
 Theorem 4

Consider an MDP where the state space \(\ALPHABET S\) and the action space \(\ALPHABET A\) are totally ordered. Suppose that, in addition to (C1) and (C2), the following condition is satisfied.
C3. For any weakly increasing function \(v\), \[ c_t(s,a) + \EXP[ v(S_{t+1})  S_t = s, A_t = a]\] is submodular in \((s,a)\).
Let \(π^*_t(s) = \max\{ a^* \in \arg \min_{a \in \ALPHABET A} Q_t(s,a) \}\). Then, \(π^*(s)\) is weakly increasing in \(s\).
Proof
Conditions (C1) and (C2) imply that the value function \(V_{t+1}(s)\) is weakly increasing. Therefore, condition (C3) implies that \(Q_t(s,a)\) is submodular in \((s,a)\). Therefore, the arg min is weakly increasing in \(x\)
It is difficult to verify condition (C3). The following conditions are sufficient for (C3).
 Lemma 1

Consider an MDP with totally ordered state and action spaces. Suppose
 \(c_t(s,a)\) is submodular in \((s,a)\).
 For all \(s' \in \ALPHABET S\), \(H(s'  s,a) = 1  \sum_{z \le s'} P_{sz}(a)\) is submodular in \((s,a)\).
The condition (C3) of the previous theorem holds.
Proof
Consider \(s^+, s^ \in \ALPHABET S\) and \(a^+, a^ \in \ALPHABET A\) such that \(s^+ > s^\) and \(a^+ > a^\). Define
\[\begin{align*} μ_1(s) &= \tfrac 12 P_{s^ s}(a^) + \tfrac 12 P_{s^+ s}(a^+), \\ μ_2(s) &= \tfrac 12 P_{s^ s}(a^+) + \tfrac 12 P_{s^+ s}(a^). \end{align*}\] Since \(H(s'  s,a)\) is submodular, we have \[ H(s'  s^+, a^+) + H(s'  s^, a^) \le H(s'  s^+, a^) + H(s'  s^, a^+) \] or equivalently, \[\sum_{z \le s'} \big[ P_{s^+ z}(a^+) + P_{s^ z}(a^) \big] \ge \sum_{z \le s'} \big[ P_{s^+ z}(a^) + P_{s^ z}(a^+) \big]. \] which implies \[ M_1(s') \ge M_2(s')\] where \(M_1\) and \(M_2\) are the CDFs of \(μ_1\) and \(μ_2\). Thus, \(μ_1 \preceq_s μ_2\).
Hence, for any weakly increasing function \(v \colon \ALPHABET S \to \reals\), \[ \sum_{s' \in \ALPHABET S} μ_1(s') v(s') \le \sum_{s' \in \ALPHABET S} μ_2(s') v(s').\] Or, equivalently, \[H(s^+, a^+) + H(s^, a^) \le H(s^, a^+) + H(s^+, a^)\] where \(H(s,a) = \EXP[ v(X_{t+1})  X_t = s, U_t = a]\).
Therefore, \(c_t(s,a) + H_t(s,a)\) is submodular in \((s,a)\).
The analogue of Theorem 4 for supermodular functions is as follows.
 Theorem 5

Consider an MDP where the state space \(\ALPHABET S\) and the action space \(\ALPHABET A\) are totally ordered. Suppose that, in addition to (C1) and (C2), the following condition is satisfied.
C4. For any weakly increasing function \(v\), \[ c_t(s,a) + \EXP[ v(S_{t+1})  S_t = s, A_t = a]\] is supermodular in \((s,a)\).
Let \(π^*_t(s) = \min\{ a^* \in \arg \min_{a \in \ALPHABET S} Q_t(s,a) \}\). Then, \(π^*(s)\) is weakly decreasing in \(s\).
Proof
Conditions (C1) and (C2) imply that the value function \(V_{t+1}(s)\) is weakly increasing. Therefore, condition (C4) implies that \(Q_t(s,a)\) is supermodular in \((s,a)\). Therefore, the arg min is decreasing in \(s\)
It is difficult to verify condition (C4). The following conditions are sufficient for (C4).
 Lemma 2

Consider an MDP with totally ordered state and action spaces. Suppose
 \(c_t(s,a)\) is supermodular in \((s,a)\).
 For all \(s' \in \ALPHABET S\), \(H(s'  s,a) = 1  \sum_{z \le s'} P_{sz}(a)\) is supermodular in \((s,a)\).
The condition (C4) of the previous theorem holds.
Proof
Consider \(s^+, s^ \in \ALPHABET S\) and \(a^+, a^ \in \ALPHABET A\) such that \(s^+ > s^\) and \(a^+ > a^\). Define
\[\begin{align*} μ_1(s) &= \tfrac 12 P_{s^ s}(a^) + \tfrac 12 P_{s^+ s}(a^+), \\ μ_2(s) &= \tfrac 12 P_{s^ s}(a^+) + \tfrac 12 P_{s^+ s}(a^). \end{align*}\] Since \(H(s'  s,a)\) is supermodular, we have \[ H(s'  s^+, a^+) + H(s'  s^, a^) \ge H(s'  s^+, a^) + H(s'  s^, a^+) \] or equivalently, \[\sum_{s' \le s'} \big[ P_{s^+ s'}(a^+) + P_{s^ s'}(a^) \big] \le \sum_{s' \le s'} \big[ P_{s^+ s'}(a^) + P_{s^ s'}(a^+) \big]. \] which implies \[ M_1(s') \le M_2(s')\] where \(M_1\) and \(M_2\) are the CDFs of \(μ_1\) and \(μ_2\). Thus, \(μ_1 \succeq_s μ_2\).
Hence, for any weakly increasing function \(v \colon \ALPHABET S \to \reals\), \[ \sum_{s' \in \ALPHABET S} μ_1(s') v(s') \ge \sum_{s' \in \ALPHABET S} μ_2(s') v(s').\] Or, equivalently, \[H(s^+, a^+) + H(s^, a^) \ge H(s^, a^+) + H(s^+, a^)\] where \(H(s,a) = \EXP[ v(X_{t+1})  X_t = s, U_t = a]\).
Therefore, \(c_t(s,a) + H_t(s,a)\) is supermodular in \((s,a)\).
3 Constraints on actions
In the results above, we have assumed that the action set \(\ALPHABET A\) is the same for all states. The results also extend to the case when the action at state \(s\) must belong to some set \(\ALPHABET A(s)\) provided the following conditions are satisfied:
 For any \(s \ge s'\), \(\ALPHABET A(s) \supseteq \ALPHABET A(s')\)
 For any \(s \in \ALPHABET S\) and \(a \in \ALPHABET A(s)\), \(a' < a\) implies that \(a' \in \ALPHABET A(s)\).
4 Monotone dynamic programming
If we can establish that the optimal policy is monontone, then we can use this structure to implement the dynamic program more efficient. Suppose \(\ALPHABET S = \{1, \dots, n\}\) and \(\ALPHABET A = \{1, \dots. m\}\). The main idea is as follows. Suppose \(V_{t+1}(\cdot)\) has been caclulated. Insead of computing \(Q_t(s,a)\) and \(V_t(s)\), proceed as follows:
Set \(s = 1\) and \(α = 1\).
For all \(u \in \{α, \dots, m\}\), compute \(Q_t(s,a)\) as usual.
Compute
\[V_t(s) = \min_{ α \le a \le m } Q_t(s,a)\]
and set
\[π_t^*(s) = \max \{ a \in \{α, \dots, m\} : V_t(s) = Q_t(s,a) \}.\]
If \(s = n\), then stop. Otherwise, set \(α = π_t^*(s)\) and \(s = s+1\) and go to step 2.
5 Example: A machine replacement model
In this section, we use Theorem 4 to establish the structure of the optimal policy in a model of machine replacement.
Consider a manufacturing process, where the machine used for manufacturing deteriorates over time. Let \(\ALPHABET S = \{0, 1, \dots \}\) represent the condition of the machine. The higher the value of \(s\), the worse the condition of the equipment. Note that, for simplicity, we have assumed that the state space is countable.
At each decision epoch, the decision maker can choose actions from the set \(\ALPHABET A = \{0,1\}\). Action \(0\) corresponds to operating the equipment as is for an additional period, while action 1 corresponds to scrapping the equipment and replacing it with a new and identical piece of equipment. We assume that there is a PMF \(μ\) on \(\ALPHABET S\) such that \[ P_{sz}(0) = \begin{cases} 0, & z < s \\ μ_{z  s}, & z \ge s \end{cases} \quad\text{and}\quad P_sz(1) = μ_z. \] The transition matrix \(P(0)\) states that the equipment detiorates from \(s\) to \(s + k\), with probability \(μ_k\) for all states \(s\).
We assume that running the machine in state \(s\) costs \(h(s)\), where \(h(\cdot)\) is a weakly increasing function. Replacing the machine costs a constant amount~\(K\).
We now verify the conditions (C1)–(C4) for the model.
C1. For \(a = 0\), \(c(s,0) = h(s)\), which is weakly increasing by assumption. For \(a = 1\), \(c(s,1) = K\), which is trivially weakly increasing.
C2. For \(a = 0\), \(P(0)\) is stochastically monotone (because the CDF of \(P(\cdot  s, 0)\) lies above the CDF of \(P(\cdot  s+1, 0)\)). For \(a = 1\), all rows of \(P(1)\) are the same; therefore \(P(1)\) is stochastically monotone.
Since (C1) and (C2) are satisfied, by Theorem 1, we can assert that the value function is weakly increasing.
C3. \(c(s,1)  c(s,0) = K  h(s)\), which is weakly decreasing in \(s\). Therefore, \(c(s,a)\) is submodular in \((s,a)\).
C4. Recall that \(H(s's,a) = 1  \sum_{z \le s'} P_{sz}(a).\) Therefore,
\[H(s's,0) = 1  \sum_{z = s}^{s'} μ_{z s} = 1  \sum_{k = 0}^{s'  s} μ_k = 1  M_{s'  s},\] where \(M\) is the CMF of \(μ\), and \[H(s's,1) = 1  \sum_{z \le s'} μ_z = 1  M_{s'},\]
Therefore, \(H(s's,1)  H(s's,0) = M_{s's}  M_{s'}\). For any fixed \(s'\), \(H(s's,1)  H(s's,0)\) is weakly decreasing in \(s\). There \(H(s's,a)\) is submodular in \((s,a)\).
Since (C1)–(C4) are satisfied, the optimal policy is weakly increasing in~\(s\). Since there are only two actions, it means that for every time, there exists a state \(s^*_t\) with the property that if \(s\) exceeds \(s^*_t\), the optimal decision is to replace the machine; and if \(s \le s^*_t\), then the optimal decision is to operate the machine for another period.
Exercises
Consider the example of machine repair presented in notes on matrix formulation of MDPs. Prove that the optimal policy for that model is weakly increasing.
Suppose the state space \(\ALPHABET S\) is a symmetric subset of integers of the form \(\{L, L + 1, \dots, L1, L\}\) and the action space \(\ALPHABET A\) is discrete. Let \(\ALPHABET X_{\ge 0}\) denote the set \(\{0, \dots, L\}\).
Let \(P(a)\) denote the controlled transition matrix and \(c_t(s,a)\) denote the perstep cost. To avoid ambiguity, we define the optimal policy as \[ π^*_t(s) = \begin{cases} \max\bigl\{ a' \in \arg\min_{a \in \ALPHABET A} Q_t(s,a) \bigr\}, & \text{if } s \ge 0 \\ \min\bigl\{ a' \in \arg\min_{a \in \ALPHABET A} Q_t(s,a) \bigr\}, & \text{if } s < 0 \end{cases}\] The purpose of this exercise is to identify conditions under which the value function and the optimal policy are even and quasiconvex. We do so using the following steps.
We say that the transition probability matrix \(P(a)\) is even if for all \(s, s' \in \ALPHABET S\), \(P(s's,a) = P(s's,a)\). Prove the following result.
Suppose the MDP satisfies the following properties:
(A1) For every \(t\) and \(a \in \ALPHABET A\), \(c_t(s,a)\) is even function of \(s\).
(A2) For every \(a \in \ALPHABET A\), \(P(a)\) is even.
Then, for all \(t\), \(V_t\) and \(π_t\) are even functions.
Given any probability mass function \(μ\) on \(\ALPHABET S\), define the folded probability mass function \(\tilde μ\) on \(\ALPHABET X_{\ge 0}\) as follows: \[ \tilde μ(s) = \begin{cases} μ(0), & \text{if } s = 0 \\ μ(s) + μ(s), & \text{if } s > 0. \end{cases} \]
For ease of notation, we use \(\tilde μ = \mathcal F μ\) to denote this folding operation. Note that an immediate consequence of the definition is the following (you don’t have to prove this).
If \(f \colon \ALPHABET S \to \reals\) is even, then for any probability mass function \(μ\) on \(\ALPHABET S\) and \(\tilde μ = \mathcal F μ\), we have \[ \sum_{s \in \ALPHABET S} f(s) μ(s) = \sum_{s \in \ALPHABET X_{\ge 0}} f(s) \tilde μ(s). \]
Thus, the expectation of the function \(f \colon \ALPHABET S \to \reals\) with respect to the PMF \(μ\) is equal to the expectation of the function \(f \colon \ALPHABET X_{\ge 0} \to \reals\) with respect to the PMF \(\tilde μ = \mathcal F μ\).
Now given any probability transition matrix \(P\) on \(\ALPHABET S\), we can define a probability transition matrix \(\tilde P\) on \(\ALPHABET X_{\ge 0}\) as follows: for any \(s \in \ALPHABET S\), \(\tilde P_s = \mathcal F P_s\), where \(P_s\) denotes the \(s\)th row of \(P\). For ease of notation, we use \(\tilde P = \mathcal F P\) to denote this relationship.
Now prove the following:
Given the MDP \((\ALPHABET S, \ALPHABET A, P, \{c_t\})\), define the folded MDP as \((\ALPHABET S_{\ge 0}, \ALPHABET A, \tilde P, \{c_t\})\), where \(\tilde P(a) = \mathcal F P(a)\) for all \(a \in \ALPHABET A\). Let \(\tilde Q_t \colon \ALPHABET S_{\ge 0} \times \ALPHABET A \to \reals\), \(\tilde V_t \colon \ALPHABET S_{\ge 0} \to \reals\) and \(\tilde π_t^* \colon \ALPHABET S_{\ge 0} \to \ALPHABET A\) denote the actionvalue function, value function and the policy of the folded MDP. Then, if the original MDP satisfies conditions (A1) and (A2) then, for any \(s \in \ALPHABET S\) and \(a \in \ALPHABET A\), \[ Q_t(s,a) = \tilde Q_t(s, a), \quad V_t(s) = \tilde V_t(s), \quad π_t^*(s) = \tilde π_t^*(s). \]
The result of the previous part implies that if the value function \(\tilde V_t\) and the policy \(\tilde π^*_t\) are monotone increasing, then the value function \(V_t\) and the policy \(π^*_t\) are even and quasiconvex. This gives us a method to verify if the value function and optimal policy are even and quasiconvex.
Now, recall the model of the Internet of Things presented in Q2 of Assignment 3. The numerical experiments that you did in Assignment 3 suggest that the value function and the optimal policy are even and quasiconvex. Prove that this is indeed the case.
Now suppose the distribution of \(W_t\) is not Gaussian but is some general probability density \(\varphi(\cdot)\) and the cost function is \[ c(e,a) = \lambda a + (1  a) d(e). \] Find conditions on \(\varphi\) and \(d\) such that the value function and optimal policy are even and quasiconvex.
References
Ross (1974) has an early treatment of monotonicity of optimal policies. The general theory was developed by Topkis (1998). The presentation here follows Puterman (2014). Exercise 2 is from Chakravorty and Mahajan (2018).
This entry was last updated on 25 Aug 2022 and posted in MDP and tagged stochastic monotonicity, submodularity, monotone policies.