# ECSE 506: Stochastic Control and Decision Theory

Numerics: Matrix formulation of Markov decision processes

# 1 MDPs as controlled Markov chains

In this section, we present a matrix formulation for finite state finite action MDPs, which is useful for computing the solutions numerically. Let’s start with the function model described earlier and assume that $$\ALPHABET S$$ and $$\ALPHABET A$$ are finite sets and that the cost function and the probability distribution of $$\{W_t\}_{t \ge 1}$$ are time-homogeneous. Then, the following is a fundamental property of MDPs:

Lemma

For any $$s_1, s_2, \dots, s_T \in \ALPHABET S$$ and $$a_1, \dots, a_T \in \ALPHABET A$$, we have $\PR(S_{t+1} = s_{t+1} | S_{1:t} = s_{1:t}, A_{1:t} = a_{1:t}) = \PR(S_{t+1} = s_{t+1} | S_{t} = s_t , A_t = a_t).$ Moreover, the right hand side does not depend on the policy $$π$$.

#### Proof

For a given policy $$π$$ consider \begin{align*} \hskip 2em & \hskip -2em \PR(S_{t+1} = s_{t+1} | S_{1:t} = s_{1:t}, A_{1:t} = a_{1:t}) \\ &\stackrel{(a)}= \PR(\{ w_t \in \ALPHABET W : f_t(s_t, a_t, w_t) = s_{t+1} \} | S_{1:t} = s_{1:t}, A_{1:t} = a_{1:t}) \\ &\stackrel{(b)}= \PR(\{ w_t \in \ALPHABET W : f_t(s_t, a_t, w_t) = s_{t+1} \}) \\ &= \PR(S_{t+1} = s_{t+1} | S_{t} = s_{t}, A_{t} = a_{t}) \end{align*} where $$(a)$$ follows from the update equation for $$S_{t+1}$$ and $$(b)$$ follows because the primitive random variables are independent. The expression in the right hand side of $$(b)$$ does not depend on the control strategy $$π$$.

# 2 Matrix formulation of an MDP

Let $$|\ALPHABET S| = n$$ and $$|\ALPHABET A| = m$$. The idea behind the matrix formulation of MDP is to think of $$\{S_t\}_{t \ge 1}$$ as a controlled Markov process and define the $$n \times n$$ tranistion matrices $$\{P(a)\}_{a \in \ALPHABET A}$$ as follows: $P_{ss'}(a) = \PR(S_{t+1} = s' | S_t = s, A_t = a).$

Observe that $\EXP[V_{t+1}(S_{t+1}) | S_t = s, A_t = a] = \sum_{s' \in \ALPHABET S} V_{t+1}(s') \PR(S_{t+1} = s' | S_t = s, A_t = a).$ Now, if we think of $$V_t$$ as a $$n \times 1$$ column vector, then we can write $\EXP[V_{t+1}(S_{t+1}) | S_t = s, A_t = a] = [ P(a)V_{t+1} ]_{s},$ or, equivalently, $\MATRIX{ \EXP[V_{t+1}(S_{t+1}) | S_t = 1, A_t = a] \\ \vdots \\ \EXP[V_{t+1}(S_{t+1}) | S_t = n, A_t = a] } = P(a)V_{t+1}.$

Now, think of $$c_t$$ and $$Q_t$$ as $$n \times m$$ matrices. Then, we can write $Q_t = c_t + \MATRIX { P(1)V_{t+1} & \cdots & P(n) V_{t+1} }.$

Define $$\bar P = \MATRIX{P(1) \\ \vdots \\ P(n)}$$, then $Q_t = c_t + \text{reshape}( \bar P \, V_{t+1}, n, m).$ Hence, the key step of dynamic programming can be done using simple linear algebra and therefore implemented efficiently in any programming language.

# 3 An Example: Machine repair

Consider a machine which can be in one of $$n$$ states $$\ALPHABET S = \{1, \dots, n\}$$, where state $$1$$ denotes the best state and state $$n$$ denotes the worse state.

There are two actions $$\ALPHABET A = \{1, 2\}$$, where $$a=1$$ means that the current machine is replaced by a brand new one (which starts in state $$1$$) and $$a=2$$ means that the existing machine is used. When action $$a=2$$ is chosen, the machine state will either remain the same (with probability $$1 - θ$$) or deteriorate by one unit (with probability $$θ$$).

For example, if $$n = 3$$, then $P(1) = \MATRIX{1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 0 & 0}, \qquad P(2) = \MATRIX{1-θ & θ & 0\\ 0 & 1-θ & θ\\ 0 & 0 & 1}.$

Now suppose that the replace action $$a = 1$$ costs $$λ$$ unit (irrespective of the state) and the do nothing action $$a = 2$$ costs $$0$$ in state $$1$$, $$1$$ in state $$2$$, and $$5$$ in state $$3$$. $c = \MATRIX{λ & 0 \\ λ & 1 \\ λ & 5 }.$

Then, the value functions and optimal policy for $$T = 4$$ and specific values of $$θ$$ and $$λ$$ are shown below. For visualization we show the replace action by $$\bullet$$ and the do thing action by $$\circ$$. You can play around with the parameters $θ =$ and $λ =$

Couple of features of the optimal solution are worth noting:

• For $$λ$$ roughly in the range of $$(5,15)$$ the optimal policy chooses the replace action in state $$3$$ at time $$1$$ even though taking the do nothing action incurs less instanteous cost.

• There is a range of values of $$λ$$ for which the the optimal policy is the same. This argument is sometimes used to illustrate why searching in the policy space can be more efficient than searching in the value space.

• It appears that the optimal policy satisifies a monotonicity property: for low states take “do nothing” action and for higher states take the “replace” action. The actual value of what constitutes a low or a high state depends on the parameters of the model. Later in the course, we will study sufficient conditions to establish monotonicity of optimal policy. It is easy to show that the setup described above satisfies these sufficient conditions, so the optimal policy is monotone for all choices of model parameters.

This entry was last updated on 08 Feb 2022 and posted in MDP and tagged numerical methods.