10  Reward Shaping

Updated

March 15, 2024

Keywords

reward shaping

Summary

What are the conditions under which two MDPs which have the same dynamics but different cost functions have the same optimal policy? This is an important question in reinforcement learning (where one often shapes the reward function to speed up learning) and inverse reinforcement learning (where one learns the reward function from the behavior of an expert). The following result provides a complete answer to this question. These results are typically established for inifinte horizon models. However, in my opinion, it is conceptually simpler to start with the finite horizon model.

Let \(M^1\) and \(M^2\) denote two MDPs on the same state space \(\ALPHABET S\) and action space \(\ALPHABET A\). Both MDPs have the same dynamics \(P = (P_1, \dots, P_T)\), but different cost functions \(c^1 = (c^1_1, \dots, c^1_T)\) and \(c^2 = (c^2_1, \dots, c^2_T)\). We assume that for \(t \in \{1, \dots, T-1\}\), the per-step cost is a function of the current state, current action, and next state (see cost depending on next state)1 and for \(t = T\), the per-step cost function is just a function of the current state.

1 We choose the cost to depend on the next state only for convenience of analysis. The result of Theorem 10.1 can be established for models where the cost depends only on the per-step cost by replacing property 2 in Theorem 10.1 by \[ c^2_t(s,a) = c^1_t(s,a) + \sum_{s' \in \ALPHABET S} P_t(s'|s,a) Φ_{t+1}(s') - Φ_t(s).\]

Theorem 10.1 Suppose the cost functions in MDPs \(M^1\) and \(M^2\) are related as follows:

  1. For \(t = T\), \[ c^2_T(s) = c^1_T(s) - Φ_T(s). \]

  2. For \(t \in \{1, \dots, T-1\}\), \[ c^2_t(s,a,s_{+}) = c^1_t(s,a,s_{+}) + Φ_{t+1}(s_{+}) - Φ_t(s). \]

Then, for any policy \(π\), \[\begin{equation}\label{eq:result} Q^{\pi,2}_t(s,a) = Q^{\pi,1}_t(s,a) - Φ_t(s) \quad\text{and}\quad V^{\pi,2}_t(s) = V^{\pi,1}_t(s) - Φ_t(s). \end{equation}\]

Sign of potential function

The sign of the potential function is irrelevant. So, we could also have written \[ c^2_t(s,a,s_{+}) = c^1_t(s,a,s_{+}) + Φ_t(s) - Φ_{t+1}(s_{+}) \] and argued that \[ V^{\pi,2}_t(s) = V^{\pi,1}_t(s) + Φ_t(s).\]

We prove the result by backward induction. First note that \[ V^{\pi,2}_T(s) = c^2_T(s) = c^1_T(s) - Φ_T(s) = V^{\pi,1}_T(s) - Φ_T(s). \] This forms the basis of induction. Now suppose that \eqref{eq:result} holds for time \(t+1\). Now consider \[\begin{align*} Q^{\pi,2}_t(s,a) &= \EXP[ c^2_t(s,a,S_{t+1}) + V^{\pi,2}_{t+1}(S_{t+1}) \mid S_t = s, a ] \\ &\stackrel{(a)}= \EXP[ c^1_t(s,a,S_{t+1}) - Φ_t(s) + Φ_{t+1}(S_{t+1}) \\ &\qquad + V^{\pi,1}_{t+1}(S_{t+1}) - Φ_{t+1}(S_{t+1}) \mid S_t = s, A_t = a ] \\ &= \EXP[ c^1_t(s,a,S_{t+1}) - Φ_t(s) + V^1_{t+1}(S_{t+1}) \mid S_t = s, A_t = a] \\ &= Q^{\pi,1}_t(s,a) - Φ_t(s), \end{align*}\] where \((a)\) follows from property 2 and the induction hypothesis.

Now, \[ \begin{align*} V^{\pi,2}_t(s) &= Q^{\pi,2}_t(s,\pi(s)) \\ &= Q^{\pi,1}_t(s,\pi(s) - Φ_t(s) \\ &= V^{\pi,1}_t(s) - Φ_t(s). \end{align*}\]

This proves the induction step.

By almost an analogous argument, we can show that the optimal value functions also satisfy a similar relationship.

Corollary 10.1 Under conditions of Theorem 10.1, \[\begin{equation}\label{eq:result-opt} Q^{2}_t(s,a) = Q^{1}_t(s,a) - Φ_t(s) \quad\text{and}\quad V^{2}_t(s) = V^{1}_t(s) - Φ_t(s). \end{equation}\]

Advantage function

The advantage (or benefit) function given by \[ B_t(s,a) := Q_t(s,a) - V_t(s) \] measures the relative cost of choosing action \(a\) over the optimal action. An implication of \eqref{eq:result-opt} is that reward shaping does not change the advantage function!

Remark

Another implication of Theorem 10.1 and Corollary 10.1 is that for any policy \(π\), \[ V^{\pi,2}_t(s) - V^{2}_t(s) = V^{\pi,1}_t(s) - V^1_t(s). \] Thus, reward shaping also preserves near-optimality; i.e., if a policy is approximately optimal in model \(M^1\), then it is approximately optimal in model \(M^2\) as well.

10.1 Generalization to discounted models

Now consider a finite horizon discounted cost problem, where the performance of a policy \(π\) is given by \[ J(π) = \EXP\Bigl[ \sum_{t=1}^{T-1} γ^{t-1} c_t(S_t, A_t) + γ^T c_T(S_T) \Bigr]. \] As argued in the introduction to discounted models, the dynamic prgram for this case is given by

\[ V^*_{T}(s) = c_T(s) \] and for \(t \in \{T-1, \dots, 1\}\): \[ \begin{align*} Q^*_t(s,a) &= c(s,a) + γ \EXP[ V^*_{t+1}(S_{t+1}) | S_t = s, A_t = a ], \\ V^*_t(s) &= \min_{a \in \ALPHABET A} Q^*_t(s,a). \end{align*} \]

For such models, we have the following.

Corollary 10.2 For discounted cost models, the results of Theorem 10.1 and Corollary 10.1 continue to hold if condition 2 is replaced by

  1. For \(t \in \{1, \dots, T-1\}\),

    \[ c^2_t(s,a,s_{+}) = c^1_t(s,a,s_{+}) + γ Φ_{t+1}(s_{+}) - Φ_t(s). \]

Infinite horizon models

If the cost function is time homogeneous, Corollary 10.2 extends naturally to infinite horizon models with a time-homogeneous potential function. A remarkable feature is that if the potential function is chosen as the value function, i.e., \(Φ(s) = V^*(s)\), then the value function of the modified cost \(\tilde c(s,a,s_{+})\) is zero!

10.2 Examples

As an example of reward shaping, see the notes on inventory management. Also see the notes on martingale approach to stochastic control for an iteresting relationship between reward shaping and martingales.

Notes

The idea of reward shaping was proposed by Skinner (1938) to synthesize complex behavior by guiding animals to perform simple functions (see :Skinner’s Box Experiment). The formal description of reward shaping comes from Porteus (1975), who established a result similar to Ng et al. (1999), and called it the transformation method. Porteus (1975) also describes transformations of the dynamics which preserve the optimal policy.

Corollary 10.2 was also re-established by Ng et al. (1999), who aslo provided a partial converse. The results of Porteus (1975) and Ng et al. (1999) were restricted to time-homogeneous potential functions. The generalization to time-varying potential functions was presented in Devlin and Kudenko (2012).

The partial converse of Corollary 10.1 established by Ng et al. (1999) states that the shaping presented in Theorem 10.1 is the only additive cost transformation that that preserves the set of optimal policy. However, this converse was derived under the assumption that the transition dynamics are complete (see Ng et al. (1999)). A similar converse under a weaker set of assumptions on the transition dynamics is established in Jenner et al. (2022).

For a discussion on practical considerations in using reward shaping in reinforcement learning, see Grzes and Kudenko (2009) and Devlin (2014). As a counter-point, Wiewiora (2003) shows that the advantages of reward shaping can also be achieved by simply adding the potential function to the \(Q\)-function initialization.