20 Periodic MDPs

Author

Affiliation

Aditya Mahajan

McGill University

Updated

December 23, 2023

Keywords

MDPs, discounted MDPs, periodic MDPs

Summary

In many applications, the system dynamics and cost vary in a periodic manner. For example, demand varies according to time of day, weather patterns vary according to season, etc. Such periodic models may be considered as the simplest case of non-stationary dynamics. In this section, we study how to model and analyze periodic MDPs.

The natural model for a periodic system with period \(L\) is as follows. Let \([t] = (t \bmod L)\) and \(\ALPHABET L = \{0, \dots, L-1\}\). Then the dynamics and per-step cost at time \(t\) are given by \(P_{[t]}\) and \(c_{[t]}\), where

\(c_\ell : \ALPHABET S × \ALPHABET A \to \reals\), for \(\ell \in \ALPHABET L\)
\(P_\ell : \ALPHABET S × \ALPHABET A \to Δ(\ALPHABET S)\), for \(\ell \in \ALPHABET L\)

Now consider an infinite-horizon \(L\)-periodic MDP. Since the model is not stationary, we cannot directly use the standard results for infinite horizon MDPs. However, it is relatively simple to characterize the solution of periodic MDPs–by using state augmentation. In particular, we define a time-homogeneous MDP with state space \(\ALPHABET S × \ALPHABET L\), action space \(\ALPHABET A\), per-step dynamics \[ \bar P((s',\ell') \mid (s,\ell), a) = P_{\ell}(s' \mid s,a) \IND\{ \ell' ≡ \ell + 1 \pmod L \} \] and the per-step cost \[ \bar c((s,\ell), a) = c_{\ell}(s,a) \IND\{ \ell' ≡ \ell + 1 \pmod L \}. \] The above time-homogenous MDP with state \((S_t, [t])\) is equivalent to the original periodic MDP. Therefore, an immediate implication of Theorem 12.2 is the following:

Proposition 20.1 There exists a time-homongeous optimal policy \(π^* \colon \ALPHABET S \times \ALPHABET L \to \ALPHABET A\) such that the optimal action at time \(t\) is \(a_t = π^*(s_t, [t])\).

Equivalently, there exists a deterministic \(L\)-periodic optimal policy \[ π^* = (π^*_0, π^*_1, \dots, π^*_{L-1}, π^*_0, π^*_1, \dots, π^*_{L-1}, \dots), \quad π^*_\ell \colon \ALPHABET S \to \ALPHABET A, \ell \in \ALPHABET L, \] such that the optimal action at time \(t\) is \(a_t = π^*_{[t]}(s_t)\).

To succinctly write the procedure to identify this policy, we define the following Bellman operators:

for any \(μ \colon \ALPHABET S \to \ALPHABET A\) and \(\ell \in \ALPHABET L\), define \(\BELLMAN_\ell^{μ}\) and \(\BELLMAN_\ell^*\) to be the Bellman operator for policy \(μ\) and Bellman optimality operator for the model \((c_\ell, P_\ell)\)
For any \(v \in \reals^n\) and \(\ell \in \ALPHABET L\), define \(\GREEDY_{\ell}(v)\) to be the greedy policy wrt \(v\) in model \((c_\ell, P_\ell)\), i.e., if \(μ \in \GREEDY_{\ell}(v)\) then \(\BELLMAN_\ell^{μ} v = \BELLMAN_\ell^* v\).

Policy evaluation

Let \(π = (π_0, \dots, π_{L-1})\) be any periodic policy and let \(V^π_t\) denote its performance starting at time \(t\). Then, \[ V^π_t = v^π_{[t]} \] where \((v^π_0, v^π_1, \dots, v^π_{L-1})\) satisfy \[ v^π_\ell = \BELLMAN^{π_\ell}_{\ell} v^π_{[\ell + 1]}, \quad \ell \in \ALPHABET L. \]

An alterative method is to evaluate the policy every \(L\) time steps. For ease of notation, we assume that the system starts at time \(0\) (rather than at time \(1\) as we have been assuming so far). In particular, define: \[ \bar c^π(s) = \EXP^{π}\biggl[ \sum_{\ell \in \ALPHABET L} γ^{\ell-1} c(S_{\ell}, A_{\ell}) \biggm| S_0 = s \biggr], \] to be the performance of policy \(π\) over one period. Note that this can be computed via dynamic programming as: \[ \bar c^π = \BELLMAN^{π_0}_{0} \BELLMAN^{π_1}_{1} \cdots \BELLMAN^{π_{L-1}}_{L-1} \mathbf{0} \] where \(\mathbf{0}\) is the all zeros vector. Now define \[ \bar P^π = P^{π_0}_0 P^{π_1}_1 \cdots P^{π_{L-1}}_{L-1} \] to be the \(L\) step transition matrix. Then, the value of policy \(π\) (evaluated every \(L\) steps) is just like a regular MDP and therefore satisfies: \[ v^π_0 = \bar c^π + γ^L \bar P^π v^π_0 \] or, equivalently, \[ v^π_0 = (I - γ^L \bar P^π)^{-1} \bar c^π \] which may be thought of as the equivalent of Proposition 12.1 for periodic MDPs.

Optimal policy

Let \((v^*_0, v^*_1, \ldots, v^*_{L-1})\) satisfy \[ v^*_{\ell} = \BELLMAN^*_{\ell} v^*_{[\ell+1]}, \quad \ell \in \ALPHABET L \] Moreover, let \((π^*_0, \dots, π^*_{L-1})\) be such that \[ π^*_\ell = \GREEDY_\ell(v^*_{[\ell+1]}), \quad \ell \in \ALPHABET L. \] Then the periodic policy \(π^* = (π^*_0, π^*_1, \ldots, π^*_{L-1})\) is policy and its performance \(V^*_t\) starting at time \(t\) satisfies \[ V^*_t = v^*_{[t]} \]

Notes

Our presentation borrows heavily from the tutorial slides of Scherrer (2016).

Periodic MDPs were first investigated in Riis (1965), who proposed the policy evaluation formula and presented a variation of policy iteration for periodic MDPs.