ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

Theory: Optimality of threshold policies in optimal stopping

Let \(\{X_t\}_{t \ge 1}\) be a Markov chain. At each time \(t\), a decision maker observes the state \(X_t\) of the Markov chain and decides whether to continue or stop the process. If the decision maker decides to continue, he incurs a continuation cost \(c_t(X_t)\) and the state evolves. If the DM decides to stop, he incurs a stopping cost of \(s_t(X_t)\) and the problem is terminated. The objective is to determine an optimal stopping time \(\tau\) to minimize \[J(\tau) := \EXP\bigg[ \sum_{t=1}^{\tau-1} c_t(X_t) + s_\tau(X_\tau) \bigg].\]

Such problems are called Optimal stopping problems. These can be solved using dynamic programming as follows.

Define the cost-to-go function of any stopping rule as \[J_t(x; \tau) = \EXP\bigg[ \sum_{s = t}^{\tau - 1} c_{\tau}(X_t) + s_\tau(X_\tau) \,\bigg|\, \tau > t \bigg]\] and the value function as \[V_t(x) = \inf_{\tau} J_t(x; \tau). \] Then, it can be shown that the value functions satisfy the following recursion:

Dynamic Program for optimal stopping \[ \begin{align*} V_T(x) &= s_T(x) \\ V_t(x) &= \min\{ s_t(x), c_t(x) + \EXP[ V_{t+1}(X_{t+1}) | X_t = x]. \end{align*}\]

Consider an optimal stopping problem and define the benefit function as the expected benefit1 of delaying the stopping decision at time \(t\), i.e., \[\begin{equation}\label{eq:B} B_t(x) = c_t(x) + \EXP[ V_{t+1}( X_{t+1}) | X_t = x] - s_t(x). \end{equation}\] Thus, it is optimal to stop whenever \(B_t(x) \ge 0\).

Note that, we can write the value function in terms of the benefit function as follows: \[\begin{align} V_t(x) &= \min\{ s_t(x), B_t(x) + s_t(x) \} \nonumber \\ &= s_t(x) + [ B_t(x) ]^-, \label{eq:V} \end{align}\] where \([y]^-\) is a short hand for \(\min\{y, 0\}\).

Now, define the one-step look-ahead function as the benefit of delaying the stopping decision by one step, i.e., \[\begin{equation}\label{eq:M} M_t(x) = c_t(x) + \EXP[ s_{t+1}(X_{t+1}) | X_t = x] - s_t(x). \end{equation}\]

The benefit function and the one-step look-ahead functions are related as follows. \[ B_T(x) = M_T(x) \] and \[ \begin{align*} B_t(x) &= c_t(x) + \EXP[ V_{t+1}(X_{t+1}) | X_t = x] - s_t(x) \\ &= c_t(x) + \EXP[ s_{t+1}(X_{t+1}) + [B_{t+1}(X_{t+1})]^- | X_t = x] - s_t(x) \\ &= M_t(x) + \EXP[ [B_{t+1}(X_{t+1}]^- | X_t = x ]. \end{align*} \]

Theorem 1

Suppose the state space is totally ordered and the following conditions hold.

  1. For all \(t\), \(M_t(x)\) is weakly increasing in \(x\).
  2. \(\{X_t\}_{t \ge 1}\) is stochastic monotone.

Then \(B_t(x)\) is weakly increasing in \(x\) for all \(t\) and there exists a sequence \(\{\lambda_t\}_{t \ge 1}\) such that it is optimal to stop at time \(t\) if and only if \(X_t \ge \lambda_t\).


Let’s contrast the above result from the monotonicity of optimal policies in general MDPs. Here, in addition to stochastic monotonicity of the Markov chain, we only require the one-step look-ahead function to be monotone. There is no assumption on the submodularity of the cost.


We first prove monotonicity of \(B_t(x)\). As usual, the proof is by backward induction. For \(t = T\), \(B_T(x) = M_T(x)\). This forms the basis of induction. Now assume that \(B_{t+1}(x)\) is increasing in \(x\) and consider the problem at time \(t\).

Since \(B_{t+1}(x)\) is increasing so is \([B_{t+1}(x)]^{-}\). Moreover, since \(\{X_t\}_{t \ge 1}\) is stochastically monotone, \(\EXP[ [B_{t+1}(X_{t+1})]^- | X_t = x]\) is increasing in \(x\). Therefore, \[ B_t(x) = M_t(x) + \EXP[ [B_{t+1}(X_{t+1})]^- | X_t = x] \] is increasing in \(x\). Thus, by induction, \(B_t(x)\) is increasing in \(x\) for all \(t\).

Recall that it is optimal to stop iff \(B_t(x) \ge 0\). Since \(B_t(x)\) is increasing in \(x\), the optimal decision rule is of a threshold type.

1 Example: Time-to-Market Model

Consider a firm that decides when to introduce a new product. When the firm introduces the product earlier than the competition, it captures a larger market share. However, an early introduction results in high production costs and low profit margins due to low manufacturing yields. Hence, the firm needs to determine the optimal time to enter the market. Suppose that the total market demand \(D\) is deterministic. Let \(\{X_t\}_{t \ge 1}\) denote the number of competitors at time \(t\). It is assumed that \[ X_{t+1} = X_t + W_t, \] where \(\{W_t\}_{t \ge 1}\) is an independent process independent of \(X_1\).

Let \(r(x)\) denote the market share of the firm when it enters the market after the \(x\)-th competitor. It is assumed that \(v(x)\) is decreasing and concave in \(x\).

Let \(p_*\) denote the sale price of the product and \(p_t\) denote the production cost at time \(t\). It is assumed that \(p_t\) decreases with \(t\).

The continuation reward is zero and the stopping reward at time \(t\) is \[ s_t(x) = r(x)(p_* - p_t) D. \] When should the firm enter the market?

First observe that \(\{X_t\}_{t \ge 1}\) is a monotone process. Now consider the one step look-ahead function \[ \begin{align*} M_t(x) &= \EXP[ s_{t+1}(x + W) ] - s_t(x) \\ &= \EXP[ r(x + W) (p_* - p_{t+1}) D ] - r(x)(p_* - p_t) D \\ &= \EXP[ r(x + W) - r(x) ] (p_* - p_{t+1}) D + r(x)( p_* - p_{t+1})D - r(x)(p_* - p_t ) D \\ &= \EXP[ r(x + W) - r(x) ] (p_* - p_{t+1}) D + r(x) (p_t - p_{t+1}) D. \end{align*} \] Since \(r(x)\) is concave, the first term is decreasing in \(x\). The second term is also decreasing in \(x\) because \(r(x)\) is decreasing in \(x\) and \(p_t \ge p_{t+1}\). Therefore, \(M_t(x)\) is decreasing in \(x\). Hence, by the above theorem[^2], there exist a sequence of thresholds \(\{\lambda_t\}_{t \ge 1}\) such that the firm should enter the market at time \(t\) iff \(X_t \ge \lambda_t\).


  1. Derive a version of Theorem 1 where \(M_t(x)\) is weakly decreasing in \(x\).

  2. Derive a version of Theorem 1 for an optimal stopping problem where the objective is reward maximization instead of cost minimization. In particular, assume that \(c_t\) denotes the continuation reward and \(s_t\) denote the stopping reward at time \(t\). Define the benefit function \(B_t(x)\) and the one-step look-ahead function \(M_t(x)\) as above.

    1. Write the benefit function in terms of the one-step look-ahead function.
    2. Derive a version similar to Theorem 1 assuming \(M_t(x)\) is increasing in \(x\).
  3. Selling an Asset

    Consider the decision problem by a person selling an asset. Let \(W_t\) denote the offer received by the person at time \(t\). We assume that \(\{W_t\}_{t \ge 1}\) is an i.i.d. process. If the person sells the asset at time \(t\), then he receives a reward equal to the best offer received so far, i.e., \(\max\{W_1, \dots, W_t\}\). If he decides to continue, then he has to pay a continuation cost of \(c\). Show that there exist a sequence of thresholds \(\{λ_t\}_{t \ge 1}\) such that the optimal strategy is to sell the asset when \(\max\{W_1, \dots, W_t\} \ge λ_t\).


The result presented in this section is taken from Oh and Özer (2016).

Oh, S. and Özer, Ö. 2016. Characterizing the structure of optimal stopping policies. Production and Operations Management 25, 11, 1820–1838. DOI: 10.1111/poms.12579.

  1. The terminology comes from reward maximization problems. In cost minimization problems, this may be thought of as the disadvantage function.↩︎

This entry was last updated on 31 Mar 2020 and posted in MDP and tagged optimal stopping, threshold policies, monotone policies.