ECSE 506: Stochastic Control and Decision Theory
Aditya Mahajan
Winter 2022
About  Lectures  Notes  Coursework
Let \(\{X_t\}_{t \ge 1}\) be a Markov chain. At each time \(t\), a decision maker observes the state \(X_t\) of the Markov chain and decides whether to continue or stop the process. If the decision maker decides to continue, he incurs a continuation cost \(c_t(X_t)\) and the state evolves. If the DM decides to stop, he incurs a stopping cost of \(s_t(X_t)\) and the problem is terminated. The objective is to determine an optimal stopping time \(\tau\) to minimize \[J(\tau) := \EXP\bigg[ \sum_{t=1}^{\tau1} c_t(X_t) + s_\tau(X_\tau) \bigg].\]
Such problems are called Optimal stopping problems. These can be solved using dynamic programming as follows.
Define the costtogo function of any stopping rule as \[J_t(x; \tau) = \EXP\bigg[ \sum_{s = t}^{\tau  1} c_{\tau}(X_t) + s_\tau(X_\tau) \,\bigg\, \tau > t \bigg]\] and the value function as \[V_t(x) = \inf_{\tau} J_t(x; \tau). \] Then, it can be shown that the value functions satisfy the following recursion:
Dynamic Program for optimal stopping \[ \begin{align*} V_T(x) &= s_T(x) \\ V_t(x) &= \min\{ s_t(x), c_t(x) + \EXP[ V_{t+1}(X_{t+1})  X_t = x]. \end{align*}\]
Consider an optimal stopping problem and define the benefit function as the expected benefit^{1} of delaying the stopping decision at time \(t\), i.e., \[\begin{equation}\label{eq:B} B_t(x) = c_t(x) + \EXP[ V_{t+1}( X_{t+1})  X_t = x]  s_t(x). \end{equation}\] Thus, it is optimal to stop whenever \(B_t(x) \ge 0\).
Note that, we can write the value function in terms of the benefit function as follows: \[\begin{align} V_t(x) &= \min\{ s_t(x), B_t(x) + s_t(x) \} \nonumber \\ &= s_t(x) + [ B_t(x) ]^, \label{eq:V} \end{align}\] where \([y]^\) is a short hand for \(\min\{y, 0\}\).
Now, define the onestep lookahead function as the benefit of delaying the stopping decision by one step, i.e., \[\begin{equation}\label{eq:M} M_t(x) = c_t(x) + \EXP[ s_{t+1}(X_{t+1})  X_t = x]  s_t(x). \end{equation}\]
The benefit function and the onestep lookahead functions are related as follows. \[ B_T(x) = M_T(x) \] and \[ \begin{align*} B_t(x) &= c_t(x) + \EXP[ V_{t+1}(X_{t+1})  X_t = x]  s_t(x) \\ &= c_t(x) + \EXP[ s_{t+1}(X_{t+1}) + [B_{t+1}(X_{t+1})]^  X_t = x]  s_t(x) \\ &= M_t(x) + \EXP[ [B_{t+1}(X_{t+1}]^  X_t = x ]. \end{align*} \]
 Theorem 1

Suppose the state space is totally ordered and the following conditions hold.
 For all \(t\), \(M_t(x)\) is weakly increasing in \(x\).
 \(\{X_t\}_{t \ge 1}\) is stochastic monotone.
Then \(B_t(x)\) is weakly increasing in \(x\) for all \(t\) and there exists a sequence \(\{\lambda_t\}_{t \ge 1}\) such that it is optimal to stop at time \(t\) if and only if \(X_t \ge \lambda_t\).
 Remark

Let’s contrast the above result from the monotonicity of optimal policies in general MDPs. Here, in addition to stochastic monotonicity of the Markov chain, we only require the onestep lookahead function to be monotone. There is no assumption on the submodularity of the cost.
Proof
We first prove monotonicity of \(B_t(x)\). As usual, the proof is by backward induction. For \(t = T\), \(B_T(x) = M_T(x)\). This forms the basis of induction. Now assume that \(B_{t+1}(x)\) is increasing in \(x\) and consider the problem at time \(t\).
Since \(B_{t+1}(x)\) is increasing so is \([B_{t+1}(x)]^{}\). Moreover, since \(\{X_t\}_{t \ge 1}\) is stochastically monotone, \(\EXP[ [B_{t+1}(X_{t+1})]^  X_t = x]\) is increasing in \(x\). Therefore, \[ B_t(x) = M_t(x) + \EXP[ [B_{t+1}(X_{t+1})]^  X_t = x] \] is increasing in \(x\). Thus, by induction, \(B_t(x)\) is increasing in \(x\) for all \(t\).
Recall that it is optimal to stop iff \(B_t(x) \ge 0\). Since \(B_t(x)\) is increasing in \(x\), the optimal decision rule is of a threshold type.
1 Example: TimetoMarket Model
Consider a firm that decides when to introduce a new product. When the firm introduces the product earlier than the competition, it captures a larger market share. However, an early introduction results in high production costs and low profit margins due to low manufacturing yields. Hence, the firm needs to determine the optimal time to enter the market. Suppose that the total market demand \(D\) is deterministic. Let \(\{X_t\}_{t \ge 1}\) denote the number of competitors at time \(t\). It is assumed that \[ X_{t+1} = X_t + W_t, \] where \(\{W_t\}_{t \ge 1}\) is an independent process independent of \(X_1\).
Let \(r(x)\) denote the market share of the firm when it enters the market after the \(x\)th competitor. It is assumed that \(v(x)\) is decreasing and concave in \(x\).
Let \(p_*\) denote the sale price of the product and \(p_t\) denote the production cost at time \(t\). It is assumed that \(p_t\) decreases with \(t\).
The continuation reward is zero and the stopping reward at time \(t\) is \[ s_t(x) = r(x)(p_*  p_t) D. \] When should the firm enter the market?
First observe that \(\{X_t\}_{t \ge 1}\) is a monotone process. Now consider the one step lookahead function \[ \begin{align*} M_t(x) &= \EXP[ s_{t+1}(x + W) ]  s_t(x) \\ &= \EXP[ r(x + W) (p_*  p_{t+1}) D ]  r(x)(p_*  p_t) D \\ &= \EXP[ r(x + W)  r(x) ] (p_*  p_{t+1}) D + r(x)( p_*  p_{t+1})D  r(x)(p_*  p_t ) D \\ &= \EXP[ r(x + W)  r(x) ] (p_*  p_{t+1}) D + r(x) (p_t  p_{t+1}) D. \end{align*} \] Since \(r(x)\) is concave, the first term is decreasing in \(x\). The second term is also decreasing in \(x\) because \(r(x)\) is decreasing in \(x\) and \(p_t \ge p_{t+1}\). Therefore, \(M_t(x)\) is decreasing in \(x\). Hence, by the above theorem[^2], there exist a sequence of thresholds \(\{\lambda_t\}_{t \ge 1}\) such that the firm should enter the market at time \(t\) iff \(X_t \ge \lambda_t\).
Exercises
Derive a version of Theorem 1 where \(M_t(x)\) is weakly decreasing in \(x\).
Derive a version of Theorem 1 for an optimal stopping problem where the objective is reward maximization instead of cost minimization. In particular, assume that \(c_t\) denotes the continuation reward and \(s_t\) denote the stopping reward at time \(t\). Define the benefit function \(B_t(x)\) and the onestep lookahead function \(M_t(x)\) as above.
 Write the benefit function in terms of the onestep lookahead function.
 Derive a version similar to Theorem 1 assuming \(M_t(x)\) is increasing in \(x\).
Selling an Asset
Consider the decision problem by a person selling an asset. Let \(W_t\) denote the offer received by the person at time \(t\). We assume that \(\{W_t\}_{t \ge 1}\) is an i.i.d. process. If the person sells the asset at time \(t\), then he receives a reward equal to the best offer received so far, i.e., \(\max\{W_1, \dots, W_t\}\). If he decides to continue, then he has to pay a continuation cost of \(c\). Show that there exist a sequence of thresholds \(\{λ_t\}_{t \ge 1}\) such that the optimal strategy is to sell the asset when \(\max\{W_1, \dots, W_t\} \ge λ_t\).
References
The result presented in this section is taken from Oh and Özer (2016).
The terminology comes from reward maximization problems. In cost minimization problems, this may be thought of as the disadvantage function.↩︎
This entry was last updated on 31 Mar 2020 and posted in MDP and tagged optimal stopping, threshold policies, monotone policies.