# ECSE 506: Stochastic Control and Decision Theory

## Aditya Mahajan

Winter 2022

### About | Lectures | Notes | Coursework

*Note*: Throughout this section, we assume that \(\ALPHABET S\) and \(\ALPHABET A\) are finite and \(|\ALPHABET S|= n\) and \(|\ALPHABET A| = m\).

We know that if \(V \le \mathcal B V\) then \(V \le V^* = \mathcal B V^*\). Thus, \(V^*\) is the “largest” \(V\) that satisfies the constraint \(V \le \mathcal B V\). This constraint can be written as a finite system of linear equations: \[\begin{equation} \label{eq:constaints} V(s) \le c(s,a) + γ\sum_{z \in \ALPHABET S} P(z|s,a) V(z), \qquad s \in \ALPHABET S, a \in \ALPHABET A. \end{equation}\] This observation provides the basis for a linear programming formulation of the discounted MDP.

### Primal Linear Program

Choose a weight function \(p \colon \ALPHABET S \to \reals_{\ge 0}\) such that \(\sum_{s \in \ALPHABET S} p(s) = 1\). This can be thought of as the distribution of the initial state. Then the primal LP corresponding to \eqref{eq:constaints} is \[\begin{gather} \max \sum_{s \in \ALPHABET S} p(s) V(s) \notag \\ \text{subject to}\quad V(s) - γ \sum_{z \in \ALPHABET S} P(z|s,a)V(z) \le c(s,a), \quad \forall s \in \ALPHABET S, a \in \ALPHABET A. \label{eq:primal} \end{gather}\]

### Dual Linear Program

Now consider the dual version of the primal LP \eqref{eq:primal}. We associate a positive dual variable \(\mu(s,a)\) for each constraint consider the following linear program:^{1} \[\begin{gather}
\min \sum_{s \in \ALPHABET S} \sum_{a \in \ALPHABET A} \mu(s,a) c(s,a)
\notag \\
\text{subject to}\quad
\mu(s,a) - γ\sum_{z \in \ALPHABET S} \sum_{a \in \ALPHABET A}
P(s | z, a) \mu(z,a) = p(s), \quad \forall s \in \ALPHABET S, \notag \\
\mu(s,a) \ge 0, \quad \forall s \in \ALPHABET S, a \in \ALPHABET A.
\label{eq:dual}
\end{gather}\]

The primal LP has \(n\) variables (\(V(1), V(2), \dots, V(n)\)) and \(n \times m\) constraints, while the dual LP has \(n \times m\) variables. So, in general, the dual is more efficient to solve.

### Basic solution and stationary policies

We state the following results without proof. See Puterman (2014, Sec 6.9) for details.

Let \(π\) be any possibly randomized history dependent policy. Define the

*occupancy measure*\[\mu_π(s,a) = \sum_{z \in \ALPHABET S} p(z) \sum_{t=1}^\infty γ^{t-1} \PR^{π}(S_t = s, A_t = a \mid S_1 = z).\]

Then, \(\mu_π\) is a feasible solution to the dual LP \eqref{eq:dual}.

Suppose \(\mu(s,a)\) is a feasible solution to the dual LP. Then for each \(s \in \ALPHABET S\) such that \(\sum_{a \in \ALPHABET A} \mu(s,a) > 0\), define a stationary randomized policy \(h\) by \[\begin{equation} \label{eq:policy} h(a|s) h_\mu(a|s) = \frac{ \mu(s,a) } { \displaystyle \sum_{a \in \ALPHABET A} \mu(s,a) }. \end{equation} \] Then, \(\mu_h\) is a feasible solution to the dual LP and \(\mu_h(s,a) = \mu(s,a)\) for all \(s \in \ALPHABET S\) and \(a \in \ALPHABET A\).

A feasible solution of an LP is called a

*basic feasible solution*if it cannot be expression as a nontrivial convex combination of other feasible solutions. A key property of a basic feasible solution is that when the LP has \(k\) constraints, then all basic feasible solutions have at most \(k\) positive components. Using this, one can show that if \(\mu\) is a basic feasible solution, then the policy \(h\) given by \eqref{eq:policy} is a deterministic policy.Similarly, for any deterministic policy \(h\), the occupancy measure \(\mu_h\) is a basic feasible solution to the dual LP.

If \(\mu^*\) is an

*optimal*solution to the dual LP, then \(h_{\mu}\) is an optimal policy.If \(\mu^*\) is an

*optimal basic*solution to the dual LP, then \(h_{\mu}\) is a deterministic optimal policy.If \(h\) is a possibly randomized optimal policy for the MDP, them \(\mu_h\) is a optimal solution to the dual LP.

If \(h\) is a deterministic optimal policy for the MDP, then \(\mu_h\) is a optimal basic solution to the dual LP.

As long as \(p(s) > 0\) for all \(s\), the policy \(h_\mu\) based of optimal solution the dual LP does not depend on \(p(s)\). This means that \(p(s)\) need not sum to 1.

There are various trade-offs between solving a DP using value iteration, policy iteration, and LP and we will not go into these trade-offs here. But one benefit of the LP formulation is that it is easy to add constraints, which we discuss next.

# 1 Constrained MDPs

Suppose, in addition to the per-step cost function \(c \colon \ALPHABET S \times \ALPHABET A \to \reals\), we have a per-step constraint function \(d \colon \ALPHABET S \times \ALPHABET A \to \reals\) and we are interested in the following constrained optimization problems:

\[\begin{gather*} \min \EXP\Bigl[ \sum_{t=1}^\infty γ^{t-1} c(S_t, A_t) \Bigr] \\ \text{subject to}\quad \EXP\Bigl[ \sum_{t=1}^\infty γ^{t-1} d(S_t, A_t) \Bigr] \le D. \end{gather*}\]

The dual LP in this case is given by \[\begin{gather} \min \sum_{s \in \ALPHABET S} \sum_{a \in \ALPHABET A} \mu(s,a) c(s,a) \notag \\ \text{subject to}\quad \mu(s,a) - γ\sum_{s \in \ALPHABET S} \sum_{a \in \ALPHABET A} P(s | z, a) \mu(s,a) = p(s), \quad \forall s \in \ALPHABET S, \notag \\ \sum_{s \in \ALPHABET S} \sum_{a \in \ALPHABET A} \mu(s,a) d(s,a) \le D \notag \\ \mu(s,a) \ge 0, \quad \forall s \in \ALPHABET S, a \in \ALPHABET A. \end{gather}\]

If we interpret \(\mu(s,a)\) as the occupation measure of any policy, then this formulation follows immediately.

# References

The material in this section is taken from Puterman (2014). See Altman (1999) for a detailed treatment of constrained MDPs.

*Constrained markov decision processes*. CRC Press. Available at: http://www-sop.inria.fr/members/Eitan.Altman/TEMP/h.pdf.

*Markov decision processes: Discrete stochastic dynamic programming*. John Wiley & Sons. DOI: 10.1002/9780470316887.

Wikipedia has a decent description of how to construct a dual LP from a primal LP. I find the explanation in Sebastien Lahaie’s notes to be more illuminating.↩︎

This entry was last updated on 12 Mar 2022 and posted in MDP and tagged infinite horizon, discounted cost, linear programming, constrained mdps.