ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

Numerics: Matrix formulation of Markov decision processes
Image credit: https://etc.usf.edu/clipart/28400/28480/young_boy_28480.htm

1 MDPs as controlled Markov chains

In this section, we present a matrix formulation for finite state finite action MDPs, which is useful for computing the solutions numerically. Let’s start with the function model described earlier and assume that \(\ALPHABET S\) and \(\ALPHABET A\) are finite sets and that the cost function and the probability distribution of \(\{W_t\}_{t \ge 1}\) are time-homogeneous. Then, the following is a fundamental property of MDPs:

Lemma

For any \(s_1, s_2, \dots, s_T \in \ALPHABET S\) and \(a_1, \dots, a_T \in \ALPHABET A\), we have \[ \PR(S_{t+1} = s_{t+1} | S_{1:t} = s_{1:t}, A_{1:t} = a_{1:t}) = \PR(S_{t+1} = s_{t+1} | S_{t} = s_t , A_t = a_t).\] Moreover, the right hand side does not depend on the policy \(π\).

Proof

For a given policy \(π\) consider \[\begin{align*} \hskip 2em & \hskip -2em \PR(S_{t+1} = s_{t+1} | S_{1:t} = s_{1:t}, A_{1:t} = a_{1:t}) \\ &\stackrel{(a)}= \PR(\{ w_t \in \ALPHABET W : f_t(s_t, a_t, w_t) = s_{t+1} \} | S_{1:t} = s_{1:t}, A_{1:t} = a_{1:t}) \\ &\stackrel{(b)}= \PR(\{ w_t \in \ALPHABET W : f_t(s_t, a_t, w_t) = s_{t+1} \}) \\ &= \PR(S_{t+1} = s_{t+1} | S_{t} = s_{t}, A_{t} = a_{t}) \end{align*}\] where \((a)\) follows from the update equation for \(S_{t+1}\) and \((b)\) follows because the primitive random variables are independent. The expression in the right hand side of \((b)\) does not depend on the control strategy \(π\).

2 Matrix formulation of an MDP

Let \(|\ALPHABET S| = n\) and \(|\ALPHABET A| = m\). The idea behind the matrix formulation of MDP is to think of \(\{S_t\}_{t \ge 1}\) as a controlled Markov process and define the \(n \times n\) tranistion matrices \(\{P(a)\}_{a \in \ALPHABET A}\) as follows: \[ P_{ss'}(a) = \PR(S_{t+1} = s' | S_t = s, A_t = a).\]

Observe that \[\EXP[V_{t+1}(S_{t+1}) | S_t = s, A_t = a] = \sum_{s' \in \ALPHABET S} V_{t+1}(s') \PR(S_{t+1} = s' | S_t = s, A_t = a). \] Now, if we think of \(V_t\) as a \(n \times 1\) column vector, then we can write \[ \EXP[V_{t+1}(S_{t+1}) | S_t = s, A_t = a] = [ P(a)V_{t+1} ]_{s}, \] or, equivalently, \[ \MATRIX{ \EXP[V_{t+1}(S_{t+1}) | S_t = 1, A_t = a] \\ \vdots \\ \EXP[V_{t+1}(S_{t+1}) | S_t = n, A_t = a] } = P(a)V_{t+1}. \]

Now, think of \(c_t\) and \(Q_t\) as \(n \times m\) matrices. Then, we can write \[ Q_t = c_t + \MATRIX { P(1)V_{t+1} & \cdots & P(n) V_{t+1} }. \]

Define \(\bar P = \MATRIX{P(1) \\ \vdots \\ P(n)}\), then \[ Q_t = c_t + \text{reshape}( \bar P \, V_{t+1}, n, m). \] Hence, the key step of dynamic programming can be done using simple linear algebra and therefore implemented efficiently in any programming language.

3 An Example: Machine repair

Consider a machine which can be in one of \(n\) states \(\ALPHABET S = \{1, \dots, n\}\), where state \(1\) denotes the best state and state \(n\) denotes the worse state.

There are two actions \(\ALPHABET A = \{1, 2\}\), where \(a=1\) means that the current machine is replaced by a brand new one (which starts in state \(1\)) and \(a=2\) means that the existing machine is used. When action \(a=2\) is chosen, the machine state will either remain the same (with probability \(1 - θ\)) or deteriorate by one unit (with probability \(θ\)).

For example, if \(n = 3\), then \[ P(1) = \MATRIX{1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 0 & 0}, \qquad P(2) = \MATRIX{1-θ & θ & 0\\ 0 & 1-θ & θ\\ 0 & 0 & 1}. \]

Now suppose that the replace action \(a = 1\) costs \(λ\) unit (irrespective of the state) and the do nothing action \(a = 2\) costs \(0\) in state \(1\), \(1\) in state \(2\), and \(5\) in state \(3\). \[ c = \MATRIX{λ & 0 \\ λ & 1 \\ λ & 5 }. \]

Then, the value functions and optimal policy for \(T = 4\) and specific values of \(θ\) and \(λ\) are shown below. For visualization we show the replace action by \(\bullet\) and the do thing action by \(\circ\). You can play around with the parameters $θ = $ and $λ = $

Couple of features of the optimal solution are worth noting:

This entry was last updated on 08 Feb 2022 and posted in MDP and tagged numerical methods.