# ECSE 506: Stochastic Control and Decision Theory

## Aditya Mahajan Winter 2022

Theory: Linear Programming Formulation and Constrained MDPs

Note: Throughout this section, we assume that $$\ALPHABET S$$ and $$\ALPHABET A$$ are finite and $$|\ALPHABET S|= n$$ and $$|\ALPHABET A| = m$$.

We know that if $$V \le \mathcal B V$$ then $$V \le V^* = \mathcal B V^*$$. Thus, $$V^*$$ is the “largest” $$V$$ that satisfies the constraint $$V \le \mathcal B V$$. This constraint can be written as a finite system of linear equations: $$$\label{eq:constaints} V(s) \le c(s,a) + γ\sum_{z \in \ALPHABET S} P(z|s,a) V(z), \qquad s \in \ALPHABET S, a \in \ALPHABET A.$$$ This observation provides the basis for a linear programming formulation of the discounted MDP.

### Primal Linear Program

Choose a weight function $$p \colon \ALPHABET S \to \reals_{\ge 0}$$ such that $$\sum_{s \in \ALPHABET S} p(s) = 1$$. This can be thought of as the distribution of the initial state. Then the primal LP corresponding to \eqref{eq:constaints} is $\begin{gather} \max \sum_{s \in \ALPHABET S} p(s) V(s) \notag \\ \text{subject to}\quad V(s) - γ \sum_{z \in \ALPHABET S} P(z|s,a)V(z) \le c(s,a), \quad \forall s \in \ALPHABET S, a \in \ALPHABET A. \label{eq:primal} \end{gather}$

### Dual Linear Program

Now consider the dual version of the primal LP \eqref{eq:primal}. We associate a positive dual variable $$\mu(s,a)$$ for each constraint consider the following linear program:1 $\begin{gather} \min \sum_{s \in \ALPHABET S} \sum_{a \in \ALPHABET A} \mu(s,a) c(s,a) \notag \\ \text{subject to}\quad \mu(s,a) - γ\sum_{z \in \ALPHABET S} \sum_{a \in \ALPHABET A} P(s | z, a) \mu(z,a) = p(s), \quad \forall s \in \ALPHABET S, \notag \\ \mu(s,a) \ge 0, \quad \forall s \in \ALPHABET S, a \in \ALPHABET A. \label{eq:dual} \end{gather}$

The primal LP has $$n$$ variables ($$V(1), V(2), \dots, V(n)$$) and $$n \times m$$ constraints, while the dual LP has $$n \times m$$ variables. So, in general, the dual is more efficient to solve.

### Basic solution and stationary policies

We state the following results without proof. See Puterman (2014, Sec 6.9) for details.

1. Let $$π$$ be any possibly randomized history dependent policy. Define the occupancy measure

$\mu_π(s,a) = \sum_{z \in \ALPHABET S} p(z) \sum_{t=1}^\infty γ^{t-1} \PR^{π}(S_t = s, A_t = a \mid S_1 = z).$

Then, $$\mu_π$$ is a feasible solution to the dual LP \eqref{eq:dual}.

2. Suppose $$\mu(s,a)$$ is a feasible solution to the dual LP. Then for each $$s \in \ALPHABET S$$ such that $$\sum_{a \in \ALPHABET A} \mu(s,a) > 0$$, define a stationary randomized policy $$h$$ by $$$\label{eq:policy} h(a|s) h_\mu(a|s) = \frac{ \mu(s,a) } { \displaystyle \sum_{a \in \ALPHABET A} \mu(s,a) }.$$$ Then, $$\mu_h$$ is a feasible solution to the dual LP and $$\mu_h(s,a) = \mu(s,a)$$ for all $$s \in \ALPHABET S$$ and $$a \in \ALPHABET A$$.

3. A feasible solution of an LP is called a basic feasible solution if it cannot be expression as a nontrivial convex combination of other feasible solutions. A key property of a basic feasible solution is that when the LP has $$k$$ constraints, then all basic feasible solutions have at most $$k$$ positive components. Using this, one can show that if $$\mu$$ is a basic feasible solution, then the policy $$h$$ given by \eqref{eq:policy} is a deterministic policy.

4. Similarly, for any deterministic policy $$h$$, the occupancy measure $$\mu_h$$ is a basic feasible solution to the dual LP.

5. If $$\mu^*$$ is an optimal solution to the dual LP, then $$h_{\mu}$$ is an optimal policy.

6. If $$\mu^*$$ is an optimal basic solution to the dual LP, then $$h_{\mu}$$ is a deterministic optimal policy.

7. If $$h$$ is a possibly randomized optimal policy for the MDP, them $$\mu_h$$ is a optimal solution to the dual LP.

8. If $$h$$ is a deterministic optimal policy for the MDP, then $$\mu_h$$ is a optimal basic solution to the dual LP.

9. As long as $$p(s) > 0$$ for all $$s$$, the policy $$h_\mu$$ based of optimal solution the dual LP does not depend on $$p(s)$$. This means that $$p(s)$$ need not sum to 1.

There are various trade-offs between solving a DP using value iteration, policy iteration, and LP and we will not go into these trade-offs here. But one benefit of the LP formulation is that it is easy to add constraints, which we discuss next.

# 1 Constrained MDPs

Suppose, in addition to the per-step cost function $$c \colon \ALPHABET S \times \ALPHABET A \to \reals$$, we have a per-step constraint function $$d \colon \ALPHABET S \times \ALPHABET A \to \reals$$ and we are interested in the following constrained optimization problems:

$\begin{gather*} \min \EXP\Bigl[ \sum_{t=1}^\infty γ^{t-1} c(S_t, A_t) \Bigr] \\ \text{subject to}\quad \EXP\Bigl[ \sum_{t=1}^\infty γ^{t-1} d(S_t, A_t) \Bigr] \le D. \end{gather*}$

The dual LP in this case is given by $\begin{gather} \min \sum_{s \in \ALPHABET S} \sum_{a \in \ALPHABET A} \mu(s,a) c(s,a) \notag \\ \text{subject to}\quad \mu(s,a) - γ\sum_{s \in \ALPHABET S} \sum_{a \in \ALPHABET A} P(s | z, a) \mu(s,a) = p(s), \quad \forall s \in \ALPHABET S, \notag \\ \sum_{s \in \ALPHABET S} \sum_{a \in \ALPHABET A} \mu(s,a) d(s,a) \le D \notag \\ \mu(s,a) \ge 0, \quad \forall s \in \ALPHABET S, a \in \ALPHABET A. \end{gather}$

If we interpret $$\mu(s,a)$$ as the occupation measure of any policy, then this formulation follows immediately.

# References

The material in this section is taken from Puterman (2014). See Altman (1999) for a detailed treatment of constrained MDPs.

Altman, Eitan. 1999. Constrained markov decision processes. CRC Press. Available at: http://www-sop.inria.fr/members/Eitan.Altman/TEMP/h.pdf.
Puterman, M.L. 2014. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons. DOI: 10.1002/9780470316887.

1. Wikipedia has a decent description of how to construct a dual LP from a primal LP. I find the explanation in Sebastien Lahaie’s notes to be more illuminating.↩︎

This entry was last updated on 12 Mar 2022 and posted in MDP and tagged infinite horizon, discounted cost, linear programming, constrained mdps.