# ECSE 506: Stochastic Control and Decision Theory

## Aditya Mahajan

Winter 2022

### About | Lectures | Notes | Coursework

Markov decision processes (MDP) are the simplest model of a stochastic control system. The dynamic behavior of an MDP is modeled by an equation of the form \[ \begin{equation} S_{t+1} = f_t(S_t, A_t, W_t) \label{eq:state} \end{equation}\] where \(S_t \in \ALPHABET S\) is the state, \(A_t \in \ALPHABET A\) is the control input, and \(W_t \in \ALPHABET W\) is the noise. An agent/controller observes the state and chooses the control input \(A_t\).

Eq. \eqref{eq:state} is a *non-linear* *τtochastic* state-space model—*non-linear* because \(f_t\) can be any nonlinear function; *τtochastic* because the system is driven by stochastic noise \(\{W_t\}_{t \ge 1}\).

The controller can be as sophisticated as we want. In principle, it can analyze the entire history of observations and control actions to choose the current control action. Thus, the control action can be written as \[ A_t = π_t(S_{1:t}, A_{1:t-1}),\] where \(S_{1:t}\) is a shorthand for \((S_1, \dots, S_t)\) and a similar interpretation holds for \(A_{1:t-1})\). The function \(π_t\) is called the *control law* (or *policy* or *τtrategy*) at time \(t\).

At each time, the system incurs a cost that may depend on the current state and control action. This cost is denoted by \(c_t(S_t, A_t)\). The system operates for a time horizon \(T\). During this time, it incurs a total cost \[ \sum_{t=1}^T c_t(S_t, A_t). \]

The initial state \(S_1\) and the noise process \(\{W_t\}_{t \ge 1}\) are random variables defined on a common probability space (these are called *primitive random variables*) and are mutually independent. This seemingly benign assumption is critical for the theory that we present to go through.

Suppose we have to design such a controller. We are told the probability distribution of the initial state and the noise. We are also told the system update functions \((f_1, \dots, f_T)\) and the cost functions \((c_1, \dots, c_T)\). We are asked to choose a *control strategy* \(π = (π_1, \dots, π_T)\) to minimize the expected total cost \[ J(π) := \EXP\bigg[ \sum_{t=1}^T c_t(S_t, A_t) \bigg]. \] How should we proceed?

At first glance, the problem looks intimidating. It appears that we have to design a very sophisticated controller: one that analyzes all past data to choose a control input. However, this is not the case. A remarkable result is that the optimal control station can discard all past data and choose the control input based only on the current state of the system. Formally, we have the following:

**Optimality of Markov strategies.** For the system model described above, there is no loss of optimality in chosing the control action according to \[ A_t = π_t(S_t), \quad t=1, \dots, T.\] Such a control strategy is called a *Markov strategy*.

##
#### Proof of optimality of Markov strategies

The above result claims that the cost incurred by the best Markovian strategy is the same as the cost incurred by the best history dependent strategy. This appears to be a tall claim, so lets see how we can prove it. The main idea of the proof is to repeatedly apply Blackwell’s principle of irrelevant information -Blackwell (1964)

**Two-Step Lemma.** Consider an MDP that operates for two steps (\(T=2\)). Then there is no loss of optimality in restricting attention to a Markov control strategy at time \(t=2\).

Note that \(π_1\) is Markov because it can only depend \(S_1\).

#### Proof

Fix \(π_1\) and look at the problem of optimizing \(π_2\). The total cost is \[ \EXP[ c_1(S_1, π_1(S_1)) + c_2(S_2, π_2(S_{1:2}, A_1)) ]\] The choice of \(π_2\) does not influence the first term. So, for a fixed \(π_1\), minimizing the total cost is the equivalent to minimizing the second term. Now, from Blackwell’s principle of irrelevant information, there exists a \(π_2^* \colon S_2 \mapsto A_2\) such that for any \(π_2\) \[\EXP[c_2(S_2, π_2^*(S_2) ] \le \EXP[c_2(S_2, π_2(S_{1:2}, A_2) ].\]

**Three-Step Lemma.** Consider an MDP that operates for three steps (\(T=3\)). Assume that the control law \(π_3\) at time \(t=3\) is Markov, i.e., \(A_3 = π_3(S_3)\). Then, there is no loss of optimality in restricting attention to Markov control law at time \(t=2\).

#### Proof

Fix \(π_1\) and \(π_3\) and look at optimizing \(π_2\). The total cost is \[ \EXP[ c_1(S_1, π_1(S_1)) + c_2(S_2, π_2(S_{1:2}, A_1)) + c_3(S_3, π_3(S_3)].\]

The choice of \(π_2\) does not affect the first term. So, for a fixed \(π_1\) and \(π_3\), minimizing the total cost is the same as minimizing the last two terms. Let us look at the last term carefully. Bu the law of iterated expectations, we have \[ \EXP[ c_3(S_3, π_3(S_3) ] = \EXP[ \EXP[ c_3(S_3, π_3(S_3)) | S_2, A_2 ] ]. \] Now, \[\begin{align*} \EXP[ c_3(S_3, π_3(S_3)) | S_2 = s_2, A_2 = a_2 ] &= \sum_{s_3 \in \ALPHABET S} c_3(s_3, π_3(s_3)) \\ &= \PR( w_2 \in \ALPHABET W : f_2(s_2, a_2, w_2) = s_3 ) \\ &=: h_2(s_2, a_2). \end{align*}\] The key point is that \(h_2(s_2, a_2)\) does not depend on \(π_1\) or \(π_2\).

Thus, the total expected cost affected by the choice of \(π_2\) can be written as \[\begin{align*} \EXP[ c_2(S_2, A_2) + c_3(S_3, A_3) ] &= \EXP[ c_2(S_2, A_2) + h_2(S_2, A_2) ] \\ &=: \EXP[ \tilde c_2(S_2, A_2) ]. \end{align*}\] Now, by Blackwell’s principle of irrelevant information, there exists a \(π_2^* : S_2 \mapsto A_2\) such that for any \(π_2\), we have \[ \EXP[ \tilde c_2(S_2, π_2^*(S_2))] \le \EXP[ \tilde c_2(S_2, π_2(S_{1:2}, A_1) ].\]

Now we have enough background to present the proof of optimality of Markov strategies.

#### Proof of optimality of Markov strategies

The main idea is that any system can be thought of as a two- or three-step system by aggregating time. Suppose that the system operates for \(T\) steps. It can be thought of as a two-step system where \(t \in \{1, \dots, T - 1\}\) corresponds to step 1 and \(t = T\) corresponds to step 2. From the two-step lemma, there is no loss of optimality in restricting attention to Markov control law at step 2 (i.e., at time \(t=T\)), i.e., \[ A_T = π_T(S_T). \]

Now consider a system where we are using a Markov strategy at time \(t=T\). This system can be thought of as a three-step system where \(t \in \{1, \dots, T-2\}\) corresponds to step 1, \(t = T-1\) corresponds to step 2, and \(t=T\) corresponds to step 3. Since the controller at time \(T\) is Markov, the assumption of the three step lemma is satisfied. Thus, by that lemma, there is no loss of optimality in restricting attention to Markov controllers at step 2 (i.e., at time \(t=T-1\)), i.e., \[A_{T-1} = π_{T-1}(S_{T-1}).\]

Now consider a system where we are using a Markov strategy at time \(t \in \{T-1, T\}\). This can be thought of as a three-step system where \(t \in \{1, \dots, T - 3\}\) correspond to step 1, \(t = T-2\) correspond to step 2, and \(t \in \{T-1, T\}\) correspond to step 3. Since the controllers at time \(t \in \{T-1, T\}\) are Markov, the assumption of the three-step lemma is satisfied. Thus, by that lemma, there is no loss of optimality in restricting attention to Markov controllers at step 2 (i.e., at time \(t=T-2\)), i.e., \[A_{T-2} = π_{T-2}(S_{T-2}).\]

Proceeding this way, we continue to think of the system as a three step system by different relabeling of time. Once we have shown that the controllers at times \(t \in \{s+1, s+2, \dots, T\}\) are Markov, we relabel time as follows: \(t=\{1, \dots, s-1\}\) corresponds to step 1, \(t = s\) corresponds to step 2, and \(t \in \{s+1, \dots, T\}\) corresponds to step 3. Since the controllers at time \(t \in \{s+1, \dots, T\}\) are Markov, the assumption of the three-step lemma is satisfied. Thus, by that lemma, there is no loss of optimality in restricting attention to Markov controllers at stage 2 (i.e. at time \(s\)), i.e., \[A_τ = π_τ(S_τ).\]

Proceeding until \(s=2\), completes the proof.

# 1 Performance of Markov strategies

We have shown that there is no loss of optimality to restrict attention to Markov strategies. One of the advantages of Markov strategies is that it is easy to recursively compute their performance. In particular, given any Markov strategy \(π = (π_1, \dots, π_T)\), define *the cost-to-go functions* as follows: \[V^{π}_t(s) = \EXP^π \bigg[ \sum_{τ = t}^{T} c_τ(S_τ, π_τ(S_τ)) \biggm| S_t =
s\bigg]. \] Note that \(V^{π}_t(s)\) only depends on the future strategy \((π_t, \dots, π_T)\). These functions can be computed recursively as follows: \[\begin{align*}
V^{π}_t(s) &= \EXP^π \bigg[ \sum_{τ = t}^{T} c_τ(S_τ, π_τ(S_τ)) \biggm| S_t =
s \bigg] \\
&= \EXP^π \bigg[ c_t(s, π_t(s)) + \EXP^π \bigg[ \sum_{τ = t+1}^T
c_τ(S_τ, π_τ(S_τ)) \biggm| S_{t+1} \bigg] \biggm| S_t = s \bigg]
\\
&= \EXP^π\big[ c_t(s, π_t(s)) + J_{t+1}(S_{t+1}; π) \big| S_t = s \big].
\end{align*}\]

# 2 Dynamic Programming Decomposition

Now we are ready to state the main result of MDPs

**Theorem (Dynamic program)** Recursive define *value functions* \(\{V_t\}_{t = 1}^{T+1} \colon \ALPHABET S \to \reals\) as follows: \[ \begin{equation} \label{eq:DP-1}
V_{T+1}(s) = 0
\end{equation} \] and for \(t \in \{T, \dots, 1\}\): \[\begin{align}
Q_t(s,a) &= c(s,a) + \EXP[ V_{t+1}(S_{t+1}) | S_t = s, A_t = a]
\nonumber \\
&= c(s,a) + \EXP[ V_{t+1}(f_t(s,a,W_t)) ], \label{eq:DP-2}
\end{align}\] and define \[ \begin{equation} \label{eq:DP-3}
V_t(s) = \min_{a \in \ALPHABET A} Q_t(s,a).
\end{equation} \] Then, a Markov policy is optimal if and only if it satisfies \[ \begin{equation} \label{eq:verification}
π_t^*(s) = \arg \min_{a \in \ALPHABET A} Q_t(s,a).
\end{equation} \]

Instead of proving the above result, we prove a related result

**Theorem (The comparison principle)** For any Markov strategy \(π\) \[ V^{π}_t(s) \ge V_t(s) \] with equality at \(t\) if and only if the *future strategy* \(π_{t:T}\) satisfies the verification step \eqref{eq:verification}.

Note that the comparison principle immediately implies that the strategy obtained using dynamic programming is optimal.

The comparison principle also allows us to interpret the value functions. The value function at time \(t\) is the minimum of all the cost-to-go functions over all future strategies. The comparison principle also allows us to interpret the optimal policy (the interpretation is due to Bellman and is colloquially called Bellman’s principle of optimality).

**Bellman’s principle of optimality.** An optimal policy has the property that whatever the initial state and the initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

##
#### Proof of the comparison principle

The proof proceeds by backward induction. Consider any Markov strategy \(π = (π_1, \dots, π_T)\). For \(t = T\), \[ \begin{align*} V_T(s) &= \min_{a \in \ALPHABET A} Q_T(s,a) \\ &\stackrel{(a)}= \min_{a \in \ALPHABET A} c_T(s,a) \\ &\stackrel{(b)}\le c_T(s, π_T(s)) \\ &\stackrel{(c)}= V^{π}_T(s), \end{align*} \] where \((a)\) follows from the definition of \(Q_T\), \((b)\) follows from the definition of minimization, and \((c)\) follows from the definition of \(J_T\). Equality holds in \((b)\) iff the policy \(π_T\) is optimal. This result forms the basis of induction.

Now assume that the statement of the theorem is true for \(t+1\). Then, for \(t\) \[ \begin{align*} V_t(s) &= \min_{a \in \ALPHABET A} Q_t(s,a) \\ &\stackrel{(a)}= \min_{a \in \ALPHABET A} \Big\{ c_t(s,a) + \EXP[ V_{t+1}(S_{t+1}) | S_t = s, A_t = a] \Big\} \\ &\stackrel{(b)}\le \Big\{ c_t(s,π_t(s)) + \EXP[ V_{t+1}(S_{t+1}) | S_t = s, A_t = π_t(s)] \Big\} \\ &\stackrel{(c)}\le \Big\{ c_t(s,π_t(s)) + \EXP[ J_{t+1}(S_{t+1}; π) | S_t = s, A_t = π_t(s)] \Big\} \\ &\stackrel{(d)}= V^{π}_t(s), \end{align*} \] where \((a)\) follows from the definition of \(Q_t\), \((b)\) follows from the definition of minimization, \((c)\) follows from the induction hypothesis, and \((d)\) follows from the definition of \(J_t\). We have equality in step \((b)\) iff \(π_t\) satisfies the verification step \eqref{eq:verification} and have equality in step \((c)\) iff \(π_{t+1:T}\) is optimal (this is part of the induction hypothesis). Thus, the result is true for time \(t\) and, by the principle of induction, is true for all time. \(\Box\)

# 3 Variations of a theme

## 3.1 Cost depends on next state

In the basic model that we have considered above, we assumed that the per-step cost depends only on the current state and current actions. In some applications, such as the inventory management model considered in class, it is more natural to have a cost function where the cost depends on the current state, current action, and the next state. Conceptually, such problems can be treated in the same way as the standard model.

In particular, suppose we have a per-step cost given by \(c_t(S_t,A_t,S_{t+1})\), where the objective is to minimize \[ J(π) = \EXP\Bigl[ \sum_{t=1}^T c_t(S_t, A_t, S_{t+1}) \Bigr]. \]

Define \[ \tilde c_t(s, a) = \EXP[ c_t(s, a, S_{t+1}) | S_t = s, A_t = a ] = \EXP[ c_t(s,a, f_t(s,a, W_t) ]. \] Then, by the towering property of conditional expectation, we can write

\[ \begin{align*} J(π) &= \EXP\Bigl[ \sum_{t=1}^T \EXP[ c_t(S_t, A_t, S_{t+1}) | S_t, A_t] \Bigr] \\ &= \EXP\Bigl[ \sum_{t=1}^T \tilde c_t(S_t, A_t) \Bigr]. \end{align*} \]

Thus, we can equivalently consider this as our standard model with the per-step cost given by \(\tilde c_t(S_t, A_t)\). We can write the recursive step of the dynamic program as follows: \[ Q_t(s,a) = \EXP[ c_t(s,a, S_{t+1}) + V_{t+1}(S_{t+1}) | S_t = s, A_t = a ].\]

For numerically solving the dynamic program when the cost is time-homogeneous (i.e., does not depend on \(t\)), it is more efficient to compute \(\tilde c\) once and recuse that in the dynamic program recursion.

## 3.2 Discounted cost

In some applications, it is common to consider a discounted expected cost given by \[ J(π) = \EXP\Bigl[ \sum_{t=1}^T γ^{t-1} c_t(S_t, A_t) \Bigr] \] where \(γ \in (0,1)\) is called the discount factor.

There are two interpretations of the discount factor \(γ\). The first interpretation is an economic interpretation to determine the *present value* of a utility that will be received in the future. For example, suppose a decision maker is indifferent between receiving 1 dollar today or \(s\) dollars tomorrow. This means that the decision maker discounts the future at a rate \(1/s\), so \(γ = 1/s\).

The second interpretation is that of an absorbing state. Suppose we are operating a machine that generates a value of $1 each day. However, there is a probability \(p\) that the machine will break down at the end of the day. Thus, the expected return for today is $1 while the expected return for tomorrow is \((1-p)\) (which is the probability that the machine is still working tomorrow). In this case, the discount factor is defined as \((1-p)\). See Shwartz (2001) for a detailed discussion of this alternative.

The recursive step of the dynamic program for such models can be written as \[ Q_t(s,a) = c_t(s,a) + γ \, \EXP[ V_{t+1}( S_{t+1}) | S_t = s, A_t = a ].\]

## 3.3 Multiplicative cost

So far, we have assumed that the cost is additive. The dynamic proramming decomposition also works for models with multiplicative cost. In particular, suppose that the performance of any policy is given by \[ J(π) = \EXP\Bigl[ \prod_{t=1}^T c_t(S_t, A_t) \Bigr] \] where the per-step cost function is positive. Then, it can be shown that the optimal policy is given by the following dynamic program.

**Dynamic Program for multiplicative cost** Initialize \(V_{T+1}(s) = 1\) and recursively compute \[ \begin{align*}
Q_t(s,a) &= c_t(s,a) \EXP[ V_{t+1}(S_{t+1}) | S_t = s, A_t = a ], \\
V_t(s) &= \min_{a \in \ALPHABET A} Q_t(s,a).
\end{align*} \]

## 3.4 Exponential cost function

A special class of multiplicative cost function is exponential of sum: \[J(π) = \EXP\Bigl[ \exp\Bigl( \theta \sum_{t=1}^T c_t(S_t, A_t) \Bigr) \Bigr]. \]

When \(\theta > 0\), the above captures risk-averse preferences and when \(\theta < 0\), it corresponds to risk-seeking preferences. This is equivalent to a multiplicative cost \[J(π) = \EXP\Bigl[ \prod_{t=1}^T \exp( \theta c_t(S_t, A_t)) \Bigr]. \] Therefore, the dynamic program for multiplicative cost is also applicable for this model.

## 3.5 Optimal stopping

Let \(\{S_t\}_{t \ge 1}\) be a Markov chain. At each time \(t\), a decision maker observes the state \(S_t\) of the Markov chain and decides whether to continue or stop the process. If the decision maker decides to continue, he incurs a *continuation cost* \(c_t(S_t)\) and the state evolves. If the DM decides to stop, he incurs a *τtopping cost* of \(d_t(S_t)\) and the problem is terminated. The objective is to determine an optimal *τtopping time* \(\tau\) to minimize \[J(\tau) := \EXP\bigg[ \sum_{t=1}^{\tau-1} c_t(S_t) + d_\tau(S_\tau)
\bigg].\]

Such problems are called *Optimal stopping problems*.

Define the *cost-to-go function* of any stopping rule as \[V^{\tau}_t(s) = \EXP\bigg[ \sum_{τ = t}^{\tau - 1} c_{\tau}(S_t) +
d_\tau(S_\tau) \,\bigg|\, \tau > t \bigg]\] and the *value function* as \[V_t(s) = \inf_{\tau} V^{\tau}_t(s). \] Then, it can be shown that the value functions satisfy the following recursion:

**Dynamic Program for optimal stopping** \[ \begin{align*}
V_T(s) &= s_T(s) \\
V_t(s) &= \min\{ s_t(s), c_t(s) + \EXP[ V_{t+1}(S_{t+1}) | S_t = s].
\end{align*}\]

For more details on the optimal stopping problems, see Ferguson (2008).

## 3.6 Minimax setup

*To be written*

# References

The proof idea for the optimality of Markov strategies is based on a proof by Witsenhausen (1979) on the structure of optimal coding strategies for real-time communication. Note that the proof does not require us to find a dynamic programming decomposition of the problem. This is in contrast with the standard textbook proof where the optimality of Markov strategies is proved as part of the dynamic programming decomposition.

*The Annals of Mathematical Statistics*

*35*, 2, 863–865. DOI: 10.1214/aoms/1177703586.

*IEEE Transactions on Automatic Control*

*46*, 4, 644–647. DOI: 10.1109/9.917668.

*Bell System Technical Journal*

*58*, 6, 1437–1451.

This entry was last updated on 20 Jan 2022 and posted in MDP and tagged markov strategies, dynamic programming, comparison principle, principle of irrelevant information.