9 Reward Shaping
Let \(M^1\) and \(M^2\) denote two MDPs on the same state space \(\ALPHABET S\) and action space \(\ALPHABET A\). Both MDPs have the same dynamics \(P = (P_1, \dots, P_T)\), but different cost functions \(c^1 = (c^1_1, \dots, c^1_T)\) and \(c^2 = (c^2_1, \dots, c^2_T)\). We assume that for \(t \in \{1, \dots, T-1\}\), the per-step cost is a function of the current state, current action, and next state (see cost depending on next state)1 and for \(t = T\), the per-step cost function is just a function of the current state.
1 We choose the cost to depend on the next state only for convenience of analysis. The result of Theorem 9.1 can be established for models where the cost depends only on the per-step cost by replacing property 2 in Theorem 9.1 by \[ c^2_t(s,a) = c^1_t(s,a) + \sum_{s' \in \ALPHABET S} P_t(s'|s,a) Φ_{t+1}(s') - Φ_t(s).\]
Theorem 9.1 Suppose the cost functions in MDPs \(M^1\) and \(M^2\) are related as follows:
For \(t = T\), \[ c^2_T(s) = c^1_T(s) - Φ_T(s). \]
For \(t \in \{1, \dots, T-1\}\), \[ c^2_t(s,a,s_{+}) = c^1_t(s,a,s_{+}) + Φ_{t+1}(s_{+}) - Φ_t(s). \]
Then, for any policy \(π\), \[\begin{equation}\label{eq:result} Q^{\pi,2}_t(s,a) = Q^{\pi,1}_t(s,a) - Φ_t(s) \quad\text{and}\quad V^{\pi,2}_t(s) = V^{\pi,1}_t(s) - Φ_t(s). \end{equation}\]
By almost an analogous argument, we can show that the optimal value functions also satisfy a similar relationship.
Corollary 9.1 Under conditions of Theorem 9.1, \[\begin{equation}\label{eq:result-opt} Q^{2}_t(s,a) = Q^{1}_t(s,a) - Φ_t(s) \quad\text{and}\quad V^{2}_t(s) = V^{1}_t(s) - Φ_t(s). \end{equation}\]
9.1 Generalization to discounted models
Now consider a finite horizon discounted cost problem, where the performance of a policy \(π\) is given by \[ J(π) = \EXP\Bigl[ \sum_{t=1}^{T-1} γ^{t-1} c_t(S_t, A_t) + γ^T c_T(S_T) \Bigr]. \] As argued in the introduction to discounted models, the dynamic prgram for this case is given by
\[ V_{T}(s) = c_T(s) \] and for \(t \in \{T-1, \dots, 1\}\): \[ \begin{align*} Q_t(s,a) &= c(s,a) + γ \EXP[ V_{t+1}(S_{t+1}) | S_t = s, A_t = a ], \\ V_t(s) &= \min_{a \in \ALPHABET A} Q_t(s,a). \end{align*} \]
For such models, we have the following.
Corollary 9.2 For discounted cost models, the results of Theorem 9.1 and Corollary 9.1 continue to hold if condition 2 is replaced by
For \(t \in \{1, \dots, T-1\}\),
\[ c^2_t(s,a,s_{+}) = c^1_t(s,a,s_{+}) + γ Φ_{t+1}(s_{+}) - Φ_t(s). \]
9.2 Examples
As an example of reward shaping, see the notes on inventory management. Also see the notes on martingale approach to stochastic control for an iteresting relationship between reward shaping and martingales.
Notes
The idea of reward shaping was proposed by Skinner (1938) to synthesize complex behavior by guiding animals to perform simple functions (see :Skinner’s Box Experiment). The formal description of reward shaping comes from Porteus (1975), who established a result similar to Ng et al. (1999), and called it the transformation method. Porteus (1975) also describes transformations of the dynamics which preserve the optimal policy.
Corollary 9.2 was also re-established by Ng et al. (1999), who aslo provided a partial converse. The results of Porteus (1975) and Ng et al. (1999) were restricted to time-homogeneous potential functions. The generalization to time-varying potential functions was presented in Devlin and Kudenko (2012).
The partial converse of Corollary 9.1 established by Ng et al. (1999) states that the shaping presented in Theorem 9.1 is the only additive cost transformation that that preserves the set of optimal policy. However, this converse was derived under the assumption that the transition dynamics are complete (see Ng et al. (1999)). A similar converse under a weaker set of assumptions on the transition dynamics is established in Jenner et al. (2022).
For a discussion on practical considerations in using reward shaping in reinforcement learning, see Grzes and Kudenko (2009) and Devlin (2014). As a counter-point, Wiewiora (2003) shows that the advantages of reward shaping can also be achieved by simply adding the potential function to the \(Q\)-function initialization.