11  Optimal stopping

Updated

October 2, 2024

Keywords

optimal stopping, threshold policies, monotone policies

Let {St}t1 be a Markov chain. At each time t, a decision maker observes the state St of the Markov chain and decides whether to continue or stop the process. If the decision maker decides to continue, he incurs a continuation cost ct(St) and the state evolves. If the DM decides to stop, he incurs a stopping cost of dt(St) and the problem is terminated. The objective is to determine an optimal stopping time τ to minimize J(τ):=E[t=1τ1ct(St)+dτ(Sτ)].

Such problems are called Optimal stopping problems. These can be solved using dynamic programming as follows.

Define the cost-to-go function of any stopping rule as Jt0(s;τ)=E[t=t0τ1ct(St)+dτ(Sτ)|St0=s,τ>t0] and the value function as Vt(s)=infτJt(s;τ). Then, it can be shown that the value functions satisfy the following recursion:

Dynamic Program for optimal stopping

VT(s)=dT(s)Vt(s)=min{dt(s),ct(s)+E[Vt+1(St+1)|St=s].

Consider an optimal stopping problem and define the benefit (or advantage) function as the expected benefit of delaying the stopping decision at time t, i.e., (1)Bt(s)=ct(s)+E[Vt+1(St+1)|St=s]dt(s). Thus, it is optimal to stop whenever Bt(s)0.

1 The terminology comes from reward maximization problems. In cost minimization problems, this may be thought of as the disadvantage function.

Note that, we can write the value function in terms of the benefit function as follows: Vt(s)=min{dt(s),Bt(s)+dt(s)}(2)=dt(s)+[Bt(s)], where [y] is a short hand for min{y,0}.

Now, define the one-step look-ahead function as the benefit of delaying the stopping decision by one step, i.e., (3)Mt(s)=ct(s)+E[dt+1(St+1)|St=s]dt(s).

The benefit function and the one-step look-ahead functions are related as follows. BT(s)=MT(s) and Bt(s)=ct(s)+E[Vt+1(St+1)|St=s]dt(s)=ct(s)+E[dt+1(St+1)+[Bt+1(St+1)]|St=s]dt(s)=Mt(s)+E[[Bt+1(St+1]|St=s].

Theorem 11.1 Suppose the state space is totally ordered and the following conditions hold.

  1. For all t, Mt(s) is weakly increasing in s.
  2. {St}t1 is stochastic monotone.

Then Bt(s) is weakly increasing in s for all t and there exists a sequence {λt}t1 such that it is optimal to stop at time t if and only if Stλt.

We first prove monotonicity of Bt(s). As usual, the proof is by backward induction. For t=T, BT(s)=MT(s). This forms the basis of induction. Now assume that Bt+1(s) is increasing in s and consider the problem at time t.

Since Bt+1(s) is increasing so is [Bt+1(s)]. Moreover, since {St}t1 is stochastically monotone, E[[Bt+1(St+1)]|St=s] is increasing in s. Therefore, Bt(s)=Mt(s)+E[[Bt+1(St+1)]|St=s] is increasing in s. Thus, by induction, Bt(s) is increasing in s for all t.

Recall that it is optimal to stop iff Bt(s)0. Since Bt(s) is increasing in s, the optimal decision rule is of a threshold type.

Remark on the assumptions

Let’s contrast the result of from the monotonicity of optimal policies in general MDPs. Here, in addition to stochastic monotonicity of the Markov chain, we only require the one-step look-ahead function to be monotone. There is no assumption on the submodularity of the cost.

11.1 Example: Time-to-Market Model

Consider a firm that decides when to introduce a new product. When the firm introduces the product earlier than the competition, it captures a larger market share. However, an early introduction results in high production costs and low profit margins due to low manufacturing yields. Hence, the firm needs to determine the optimal time to enter the market. Suppose that the total market demand D is deterministic. Let {St}t1 denote the number of competitors at time t. It is assumed that St+1=St+Wt, where {Wt}t1 is an independent process independent of S1.

Let r(s) denote the market share of the firm when it enters the market after the s-th competitor. It is assumed that r(s) is decreasing and concave in s.

Let p denote the sale price of the product and pt denote the production cost at time t. It is assumed that pt decreases with t.

The continuation reward is zero and the stopping reward at time t is dt(s)=r(s)(ppt)D. When should the firm enter the market?

First observe that {St}t1 is a monotone process. Now consider the one step look-ahead function Mt(s)=E[dt+1(s+W)]dt(s)=E[r(s+W)(ppt+1)D]r(s)(ppt)D=E[r(s+W)r(s)](ppt+1)D+r(s)(ppt+1)Dr(s)(ppt)D=E[r(s+W)r(s)](ppt+1)D+r(s)(ptpt+1)D. Since r(s) is concave, the first term is decreasing in s. The second term is also decreasing in s because r(s) is decreasing in s and ptpt+1. Therefore, Mt(s) is decreasing in s. Hence, by , there exist a sequence of thresholds {λt}t1 such that the firm should enter the market at time t iff Stλt.

11.2 Example: Optimal choice of the best alternative

A decision maker (DM) wants to select the best alternative from a set of T alternatives. The DM evaluates the alternatives sequentially. After evaluating alternative t, the DM knows whether alternative t was the best alternative so far or not. Based on this information, the DM must decide whether to choose alternative t and stop the search, or to permanently reject alternative t and evaluate remaining alternatives. The DM may reject the last alternative and not make a choice at all. All alternatives are equally likely to be the best. Find the optimal choice strategy that maximize the probability of picking the best alternative.

This optimization problem is known by different names including secretary problem (in which the alternatives correspond to finding the best candidate as a secretary), marriage problem (in which the alternatives correspond of find the best spouse), Googol (in which the alternatives consist of finding the biggest number), parking problem (in which the alternatives correspond to finding the nearest parking spot) and so on

We can view this an optimal stopping problem with binary state St. St=1 means that alternative t is the best alternative so far. Thus, {St}t1 is an independent process with P(St=1)=1/t. The continuation reward is zero. The DM receives a stopping reward only if the current alternative is best, i.e., the current alternative is best so far (St=1) and better than all future alternative (Sτ=0,τ>t). Thus, the expected stopping reward conditioned on St is dt(s)=1{s=1}P(St+1:T=0|St=s)=stT. Thus, the optimal strategy is given by the solution of the following dynamic program.

Dynamic program

VT+1(s)=0Vt(s)=max{stT,E[Vt+1(St+1)]}

Lemma 11.1 Define Lt=Vt(0)=tt+1Vt+1(0)+1t+1Vt+1(1). Then, Vt(1)=max{tT,Lt} and, therefore, LtLt+1=[1TLt+1t+1]+with LT=0.

Proof

The result follows immediately from the definition of Lt.

Note that it is never optimal to select an alternative if it is not the best so far (i.e., St=0). Thus, we can completely characterize an optimal strategy by solving for {Lt}t=1T in a backward manner.

Theorem 11.2  

  1. There exists a critical time t0, t0<T, such that it is optimal to reject all alternatives until t01 and then select the first alternative that is superior to all previous ones, if it occurs.

  2. The critical time is the smallest integer t0 such that k=t0T11k<1.

  3. The value function are given by Lt={tTk=tT11k,for tt0,Lt0,for t<t0.

  4. For large T, t0T/e and the probability of selecting the best candidate is approximately 1/e.

Note that LtLt+10. Thus, Lt is decreasing with time.

Vt(1)=max{t/T,Lt} where the first term is increasing with time and the second term is decreasing with time. Thus, the critical time t0 is the first time when t/TLt. Since LT=0 and T/T=1, such a t0<T.

For any t<t0 (i.e., t/T<Lt), Lt1=Lt+[1TLtt]+=Lt.

For any tt0 (i.e., t/TLt), we have (t+1)/TLt+1. Therefore, Lt=Lt+1+[1TLt+1t+1]+=Lt+1+1TLt+1t+1=tT[1t+Tt+1Lt+1]. The above Lt can be shown to be equal to the form given above in point 3 by induction.

For large T, k=tT11ktT1kdk=logTt. Thus, t0T/e. Moreover V1(0)=V1(1)=L1=Lt0t0T=1e.

11.3 Example: Call options

An investor has a call option to buy one share of a stock at a fixed price p dollars and has T days to exercise this option. For simplicity, we assume that the investor makes a decision at the beginning of each day.

The investor may decide not to exercise the option but if he does exercise the option when the stock price is s, he effectively gets (sp).

Assume that the price of the stock varies with independent increments, i.e., the price on day t+1 is St+1=St+Wt where {Wt}t1 is an i.i.d. process with mean μ.

Let τ denote the stopping time when the investor exercises his option. Then the optimization problem is to maximize E[(Sτp)1{τT}].

Since this is an optimal stopping problem with perfect state observation, the optimal strategy is given by the solution of the following dynamic program

Dynamic program

VT(s)=max{sp,0}Vt(s)=max{sp,E[Vt+1(s+W)}.

Lemma 11.2  

  1. For all t, Vt(s) is increasing in s.
  2. For all t, Vt(s)s is decreasing in s.
  3. For all s, Vt(s)Vt+1(s).

The first property follows immediately from monotonicity of terminal reward and the monotonicity of the dynamics. From , to show the third property, we need to show that VT1(s)VT(s). Observe that VT1(s)=max{sp,E[VT(s+W)}max{sp,0}=VT(s).

Now we prove the second property using backward induction. At t=T, VT(s)s=max{p,s} which is decreasing in s. This forms the basis of induction. Now assume that Vt+1(s)s is decreasing in s. Then, Vt(s)s=max{p,E[Vt+1(s+W)]s}=max{p,E[Vt+1(s+W)(s+W)]+E[W]}. By the induction hypothesis the second term is decreasing in s. The minimum of a constant and a decreasing function is decreasing in s. Thus, Vt(s)s is decreasing in s. This completes the induction step.

Lemma 11.3 At any time t, if it is optimal to sell when the stock price is s, then it is optimal to sell at all ss.

Since it is optimal to sell at s, we must have (4)spE[Vt+1(s+W)] Since Vt(s)s is decreasing in s, we have that for any ss, (5)E[Vt+1(s+W)s]E[Vt+1(s+W)s]p where the last inequality follows from (4). Eq (5) implies that E[Vt+1(s+W)]sp. Thus, the stopping action is also optimal at s.

Theorem 11.3 The optimal strategy is of the threshold type. In particular, there exist numbers α1α2αT such that it is optimal to exercise the option at time t if and only if stαt.

Let Dt={s:πt(s)=1}. The previous Lemma shows that Dt is of the form [αt,), where αt=min{s:πt(s)=1}, where we assume that αt= is it is not optimal to stop in any state. Thus proves the threshold property.

To show that the thresholds are decreasing with time, it suffices to show that DtDt+1. Suppose sDt. Then, (6)spE[Vt+1(s+W)]E[Vt+2(s+W)], where the first inequality follows because sDt and the second inequality follows because Vt+1(s)Vt+2(s). Eq (6) implies that sDt+1. Hence, DtDt+1 and, therefore, the optimal thresholds are decreasing.

Exercises

Exercise 11.1 Derive a version of where Mt(s) is weakly decreasing in s.

Exercise 11.2 Derive a version of for an optimal stopping problem where the objective is reward maximization instead of cost minimization. In particular, assume that ct denotes the continuation reward and dt denote the stopping reward at time t. Define the benefit function Bt(s) and the one-step look-ahead function Mt(s) as above.

  1. Write the benefit function in terms of the one-step look-ahead function.
  2. Derive a version similar to assuming Mt(s) is increasing in s.

Exercise 11.3 (Selling an asset) Consider the decision problem by a person selling an asset. Let Wt denote the offer received by the person at time t. We assume that {Wt}t1 is an i.i.d. process. If the person sells the asset at time t, then he receives a reward equal to the best offer received so far, i.e., max{W1,,Wt}. If he decides to continue, then he has to pay a continuation cost of c. Show that there exist a sequence of thresholds {λt}t1 such that the optimal strategy is to sell the asset when max{W1,,Wt}λt.

Notes

is taken from Oh and Özer ().

The solution of the secretary problem presented here is due to Lindley (). For a history of the secretary problem and some generalizations, see Ferguson () and Freeman ().

The above model for pricing options was introduced by Taylor ().

close all nutshells