# 17 Thirfty and equalizing policies

infinite horizon, martingales, discounted cost

There is a relationship between optimal control and martingale theory, which leads to an alternative characterization of optimality. It is easier to describe this relationship for reward models, so in this section we will assume that we are interested in maximizing rewards rather than minimizing costs. We will also assume that the per-step reward \(r \colon \ALPHABET S × \ALPHABET A \to \reals_{\ge 0}\). We do not assume that \(\ALPHABET S\) or \(\ALPHABET A\) are finite.

Let \(\mathcal{F}_t = σ(S_{1:t}, A_{1:t-1})\) denote the information available to the decision maker at time \(t\). Note that the actions \(\{A_t\}_{t \ge 1}\) are \(\{\mathcal{F}_t\}_{t \ge 1}\)-adapted. For any Markov deterministic policy, let \[ V^{π}(s) = \EXP^{π}\biggl[ \sum_{t=1}^{∞} γ^{t-1} r(S_t, A_t) \biggm| S_1 = s \biggr] \] and let \(V^*(s) = \sup_{π \in Π_D}V^{π}(s)\), where \(Π_D\) is the set of all deterministic policies. We assume that the model satisfies sufficient regularity conditions such that \(V^*\) is measurable and satisfies the usual dynamic programming equation: \(V^* = \BELLMAN^* V^*\).

Fix an initial state \(S_1 = s\) and a policy \(π\). Define random sequences \(\{M^{π,s}_t\}_{t \ge 1}\) and \(\{J^{π,s}_t\}_{t \ge 1}\) defined as follows: \[ J^{π,s}_t \coloneqq \sum_{τ=1}^t γ^{t-1} r(S_t, A_t) \] and \[ M^{π,s}_1 = V^*(S_1) \quad\text{and}\quad M^{π,s}_{t+1} = J^{π,s}_t + γ^t V^*(S_{t+1}), \quad t \ge 1. \] Note that both \(\{J^{π,s}_t\}_{t \ge 1}\) and \(\{M^{π,s}_t\}_{t \ge 1}\) are adapted sequence w.r.t. \(\{\ALPHABET F_t\}_{t \ge 1}\).

**Lemma 17.1** For every initial state \(s\) and policy \(π\), the adapted sequences \(\{M^{π,s}_t\}_{t \ge 1}\) and \(\{γ^{t-1} V^*(S_t)\}_{t \ge 1}\) are non-negative supermartingaes w.r.t. \(\{\ALPHABET F_t\}_{t \ge 1}\) under \(\PR^{π,s}\).

First note that both sequences are non-negative because we have assumed that \(r(s,a) \ge 0\).

Set \(J^{π,s}_0 = 0\). We have \[ M^{π,s}_{t+1} = J^{π,s}_{t-1} + γ^{n-1}\bigl[ r(S_t, A_t) + γ V^*(S_{t+1}) \bigr]. \] Thus, \[\begin{align} \EXP[M^{π,s}_{t+1} \mid \ALPHABET F_t ] &= J^{π,s}_{t-1} + γ^{t-1} [\BELLMAN^{π} V^*](S_t) \notag \\ &\le J^{π,s}_{t-1} + γ^{t-1} [\BELLMAN^{*} V^*](S_t) \notag \\ &= M^{π,s}_t \label{eq:M-supermartingale} \end{align}\] Therefore, \(\{M^{π,s}_t\}_{t \ge 1}\) is a non-negative martingale.

Now consider, \[ γ^{t} V^*(S_{t+1}) \le γ^{t-1}[ r(S_t, A_t) + γ V^*(S_{t+1}) ] \] because \(r(s,a) \ge 0\). Thus, \[\begin{align*} \EXP[γ^t V^*(S_{t+1}) \mid \ALPHABET F_t ] &\le γ^{t-1} [\BELLMAN^{π} V^*](S_t) \\ &\le γ^{t-1} [\BELLMAN^{*} V^*](S_t) \\ &= γ^{t-1} V^*(S_t). \end{align*}\] Therefore, \(\{γ^{t-1}V^*(S_t)\}_{t \ge 1}\) is a non-negative martingale.

Therefore, it follows from supermartingale convergence theorem that the sequences \(\{M^{π,s}_t\}_{\ge 1}\) and \(\{γ^{t-1} V^*(S_t) \}_{t \ge 1}\) converge almost surely. Define \[ Λ^π(s) \coloneqq \lim_{t \to ∞} \EXP^{π,s}[ M^{π,s}_t ]. \]

**Definition 17.1 (Thirfty and equalizing policies)** A policy \(π\) is called **thrifty** at \(s \in \ALPHABET S\) if \(V^*(s) = Λ^π(s)\). It is called **equalizing** at \(s \in \ALPHABET S\) if \(Λ^π(s) = V^π(s)\).

A policy is thrifty if, with probability one, it makes no “immediate, irremediable mistakes” along any history (see Theorem 17.2); whereas a policy is equalizing, if “it is certain to force the system into states where little further rewards can be anticipated” (see Theorem 17.3).

**Theorem 17.1** A policy \(π\) is optimal at \(s \in \ALPHABET S\) if and only if \(π\) is both thirfty and equalizing at \(s\).

Note that. \[\begin{align} Λ^π(s) &= \lim_{t \to ∞} \biggl[ \EXP^{π,s}[ J^{π,s}_t ] + γ^t \EXP^{π,s}[ V^*(S_{t_1}) ] \biggr] \notag \\ &= V^π(s) + \lim_{t \to ∞} γ^t \EXP^{π,s}[ V^*(S_{t_1}) ] \notag \\ &\ge V^π(s). \label{eq:lambda-bound} \end{align}\]

Moreover, \[\begin{equation}\label{eq:value-function-1} V^*(s) = \EXP^{π,s}[ M^{π,s}_1 ] \ge \lim_{t \to ∞} \EXP^{π,s}[ M^{π,s}_t ] = Λ^{π}(s). \end{equation}\]

The result follows from \eqref{eq:lambda-bound} and \eqref{eq:value-function-1}.

To be honest, I am not 100% sure. The way I understand it, the key simplification here is that we have not assumed that the reward function is uniformly bounded in the sup-norm or a weighted-norm. So, the optimality of a policy is established without invoking the Banach fixed point theorem. This is not super important for discounted cost problems but is very useful in total cost/reward problems.

**Definition 17.2 (Conserving action)** An action \(a \in \ALPHABET A\) conserves \(V^*\) at \(s \in \ALPHABET S\) if \[
a \in \arg\sup_{a' \in \ALPHABET A}
\biggl\{
r(s,\tilde a) + γ \int_{\ALPHABET S}V(s')P(ds'|s,\tilde a)
\biggr\}
\]

Now we provide simple characterization of thrifty and equalizing policies.

**Theorem 17.2** For a given policy \(π\) and initial state \(s\), the following are equivalent:

- the policy \(π\) is thrifty at \(s\).
- the sequence \(\{M^π_t\}_{t \ge 1}\) is a \(\{\ALPHABET F_t\}_{t \ge 1}\) martingale under \(\PR^{π,s}\).
- for all \(t \ge 1\), we have \(\PR^{π,s}(A_t \text{ conserves } V^* \text{ at } S_t) = 1\).

**(a) implies (b):**Since \(π\) is thirfty, we have \[ V^*(s) = \EXP^{π,s}[ M^{π,s}_1 ] \ge \cdots \EXP^{π,s}[ M^{π,s}_t ] \ge \cdots \EXP^{π,s}[ M^{π,s}_{∞} ] = V^*(s). \] Thus, all inequalities must hold with equality. Thus, \(\{M^{π,s}_t\}_{t \ge 1}\) must be a martingale.**(b) implies (a):**Since \(\{M^{π,s}_t\}_{t \ge 1}\) is a martingale. \[ V^*(s) = \EXP^{π,s}[ M^{π,s}_1 ] = \cdots \EXP^{π,s}[ M^{π,s}_t ] = \cdots \EXP^{π,s}[ M^{π,s}_{∞} ] = Λ^{π}(s). \] Thus, \(π\) is thrifty.**(b) implies (c):**Since \(\{M^{π,s}_t\}_{t \ge 1}\) is martingale, the inequality in \eqref{eq:M-supermartingale} (in the proof of Lemma 17.1) must hold with equality, which implies (c).**(c) implies (b):**Under (c), the inequality in \eqref{eq:M-supermartingale} holds with equality. Thus, \(\{M^{π,s}_t\}_{t \ge 1}\) is a martingale.

**Theorem 17.3** A policy \(π\) is equalizing at \(s \in \ALPHABET S\) if and only if we have \[\lim_{t \to ∞} γ^t \EXP^{π,s}[ V^*(S_{t+1}) ] = 0.\]

This is an immediate consequence of \eqref{eq:lambda-bound}.

## Notes

See Davis (1979) for a historical review of martingale methods for stochastic control.

The idea of thirfty and equalizing policies is due to Dubins and Savage (2014). It was used by Blackwell (1970) for characterizing optimal solutions of total reward problems and is a commonly used technique in that literature. The “translation” to discounted cost problems is taken from Karatzas and Sudderth (2010).