ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

Theory: Reward Shaping

What are the conditions under which two MDPs which have the same dynamics but different cost functions have the same optimal policy? This is an important question in reinforcement learning (where one often shapes the reward function to speed up learning) and inverse reinforcement learning (where one learns the reward function from the behavior of an expert). The following result provides a complete answer to this question. These results are typically established for inifinte horizon models. However, in my opinion, it is conceptually simpler to start with the finite horizon model.

Let \(M^1\) and \(M^2\) denote two MDPs on the same state space \(\ALPHABET S\) and action space \(\ALPHABET A\). Both MDPs have the same dynamics \(P = (P_1, \dots, P_T)\), but different cost functions \(c^1 = (c^1_1, \dots, c^1_T)\) and \(c^2 = (c^2_1, \dots, c^2_T)\). We assume that for \(t \in \{1, \dots, T-1\}\), the per-step cost is a function of the current state, current action, and next state (see cost depending on next state)1 and for \(t = T\), the per-step cost function is just a function of the current state.

Theorem 1

Suppose the cost functions in MDPs \(M^1\) and \(M^2\) are related as follows:

  1. For \(t = T\), \[ c^2_T(s) = c^1_T(s) - Φ_T(s). \]

  2. For \(t \in \{1, \dots, T-1\}\), \[ c^2_t(s,a,s_{+}) = c^1_t(s,a,s_{+}) + Φ_{t+1}(s_{+}) - Φ_t(s). \]

Then, for any policy \(π\), \[\begin{equation}\label{eq:result} Q^{\pi,2}_t(s,a) = Q^{\pi,1}_t(s,a) - Φ_t(s) \quad\text{and}\quad V^{\pi,2}_t(s) = V^{\pi,1}_t(s) - Φ_t(s). \end{equation}\]

Remark 1

The sign of the potential function is irrelevant. So, we could also have written \[ c^2_t(s,a,s_{+}) = c^1_t(s,a,s_{+}) + Φ_t(s) - Φ_{t+1}(s_{+}) \] and argued that \[ V^{\pi,2}_t(s) = V^{\pi,1}_t(s) + Φ_t(s).\]

Proof

We prove the result by backward induction. First note that \[ V^{\pi,2}_T(s) = c^2_T(s) = c^1_T(s) - Φ_T(s) = V^{\pi,1}_T(s) - Φ_T(s). \] This forms the basis of induction. Now suppose that \eqref{eq:result} holds for time \(t+1\). Now consider \[\begin{align*} Q^{\pi,2}_t(s,a) &= \EXP[ c^2_t(s,a,S_{t+1}) + V^{\pi,2}_{t+1}(S_{t+1}) \mid S_t = s, a ] \\ &\stackrel{(a)}= \EXP[ c^1_t(s,a,S_{t+1}) - Φ_t(s) + Φ_{t+1}(S_{t+1}) \\ &\qquad + V^{\pi,1}_{t+1}(S_{t+1}) - Φ_{t+1}(S_{t+1}) \mid S_t = s, A_t = a ] \\ &= \EXP[ c^1_t(s,a,S_{t+1}) - Φ_t(s) + V^1_{t+1}(S_{t+1}) \mid S_t = s, A_t = a] \\ &= Q^{\pi,1}_t(s,a) - Φ_t(s), \end{align*}\] where \((a)\) follows from property 2 and the induction hypothesis.

Now, \[ \begin{align*} V^{\pi,2}_t(s) &= Q^{\pi,2}_t(s,\pi(s)) \\ &= Q^{\pi,1}_t(s,\pi(s) - Φ_t(s) \\ &= V^{\pi,1}_t(s) - Φ_t(s). \end{align*}\]

This proves the induction step.

By almost an analogous argument, we can show that the optimal value functions also satisfy a similar relationship.

Corollary 1

Under conditions of Theorem 1, \[\begin{equation}\label{eq:result-opt} Q^{2}_t(s,a) = Q^{1}_t(s,a) - Φ_t(s) \quad\text{and}\quad V^{2}_t(s) = V^{1}_t(s) - Φ_t(s). \end{equation}\]

Remark 2

The advantage (or benefit) function given by \[ B_t(s,a) := Q_t(s,a) - V_t(s) \] measures the relative cost of choosing action \(a\) over the optimal action. An implication of \eqref{eq:result-opt} is that reward shaping does not change the advantage function!

Remark 3

Another implication of Theorem 1 and Corollary 1 is that for any policy \(π\), \[ V^{\pi,2}_t(s) - V^{2}_t(s) = V^{\pi,1}_t(s) - V^1_t(s). \] Thus, reward shaping also preserves near-optimality; i.e., if a policy is approximately optimal in model \(M^1\), then it is approximately optimal in model \(M^2\) as well.

1 Generalization to discounted models

Now consider a finite horizon discounted cost problem, where the performance of a policy \(π\) is given by \[ J(π) = \EXP\Bigl[ \sum_{t=1}^{T-1} γ^{t-1} c_t(S_t, A_t) + γ^T c_T(S_T) \Bigr]. \]

As argued in the introduction to discounted models, the dynamic prgram for this case is given by

\[ V_{T}(s) = c_T(s) \] and for \(t \in \{T-1, \dots, 1\}\): \[ \begin{align*} Q_t(s,a) &= c(s,a) + γ \EXP[ V_{t+1}(S_{t+1}) | S_t = s, A_t = a ], \\ V_t(s) &= \min_{a \in \ALPHABET A} Q_t(s,a). \end{align*} \]

For such models, we have the following.

Corollary 2

For discounted cost models, the results of Theorem 1 and Corollary 1 continue to hold if condition 2 is replaced by

  1. For \(t \in \{1, \dots, T-1\}\),

    \[ c^2_t(s,a,s_{+}) = c^1_t(s,a,s_{+}) + γ Φ_{t+1}(s_{+}) - Φ_t(s). \]

Remark 4

If the cost function is time homogeneous, Corollary 2 extends naturally to infinite horizon models with a time-homogeneous potential function. A remarkable feature is that if the potential function is chosen as the value function, i.e., \(Φ(s) = V(s)\), then the value function of the modified cost \(\tilde c(s,a,s_{+})\) is zero!

2 Examples

As an example of reward shaping, see the notes on inventory management. Also see the notes on martingale approach to stochastic control for an iteresting relationship between reward shaping and martingales.

References

The idea of reward shaping was proposed by Skinner (1938) to synthesize complex behavior by guiding animals to perform simple functions (see Skinner’s Box Experiment). The formal description of reward shaping comes from Porteus (1975), who established a result similar to Ng et al. (1999), and called it the transformation method. Porteus (1975) also describes transformations of the dynamics which preserve the optimal policy.

Corollary 2 was also re-established by Ng et al. (1999), who aslo provided a partial converse. The results of Porteus (1975) and Ng et al. (1999) were restricted to time-homogeneous potential functions. The generalization to time-varying potential functions was presented in Devlin and Kudenko (2012).

The partial converse of Corollary 1 established by Ng et al. (1999) states that the shaping presented in Theorem 1 is the only additive cost transformation that that preserves the set of optimal policy. However, this converse was derived under the assumption that the transition dynamics are complete (see Ng et al. (1999)). A similar converse under a weaker set of assumptions on the transition dynamics is established in Jenner et al. (2022).

For a discussion on practical considerations in using reward shaping in reinforcement learning, see Grzes and Kudenko (2009) and Devlin (2014). As a counter-point, Wiewiora (2003) shows that the advantages of reward shaping can also be achieved by simply adding the potential function to the \(Q\)-function initialization.


Devlin, S. 2014. Potential based reward shaping tutorial. Available at: http://www-users.cs.york.ac.uk/~devlin/presentations/pbrs-tut.pdf.
Devlin, S. and Kudenko, D. 2012. Dynamic potential-based reward shaping. Proceedings of the 11th international conference on autonomous agents and multiagent systems, International Foundation for Autonomous Agents; Multiagent Systems, 433–440.
Grzes, M. and Kudenko, D. 2009. Theoretical and empirical analysis of reward shaping in reinforcement learning. International conference on machine learning and applications, 337–344. DOI: 10.1109/ICMLA.2009.33.
Jenner, E., Hoof, H. van, and Gleave, A. 2022. Calculus on MDPs: Potential shaping as a gradient. Available at: https://arxiv.org/abs/2208.09570v1.
Ng, A.Y., Harada, D., and Russell, S. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. ICML, 278–287. Available at: http://aima.eecs.berkeley.edu/~russell/papers/icml99-shaping.pdf.
Porteus, E.L. 1975. Bounds and transformations for discounted finite markov decision chains. Operations Research 23, 4, 761–784. DOI: 10.1287/opre.23.4.761.
Skinner, B.F. 1938. Behavior of organisms. Appleton-Century.
Wiewiora, E. 2003. Potential-based shaping and q-value initialization are equivalent. Journal of Artificial Intelligence Research 19, 1, 205–208.

  1. We choose the cost to depend on the next state only for convenience of analysis. The result of Theorem 1 can be established for models where the cost depends only on the per-step cost by replacing property 2 in Theorem 1 by \[ c^2_t(s,a) = c^1_t(s,a) + \sum_{s' \in \ALPHABET S} P_t(s'|s,a) Φ_{t+1}(s') - Φ_t(s).\]↩︎

This entry was last updated on 04 Oct 2022 and posted in MDP and tagged reward shaping.