17  Inventory management (revisted)

Updated

February 25, 2026

Keywords

inventory management, base-stock policy, reward shaping, structural results, stochastic optimization, infinite horizon, discounted cost

Note Summary

One of the potential benefits of modeling a system as infinite horizon discounted cost MDP is that it can be simpler to identify an optimal policy. We illustrate this using the inventory management example.

Let’s reconsider the model for inventory management and assume that it runs for an infinite horizon. For simplicity, we consider the case when the state and action spaces are real valued. Recall that the dynamics are given by \[ S_{t+1} = S_t + A_t - W_t \] We take the the per-step cost to be \[c(s,a,s_{+}) = p a + γ h(s_{+}), \] where \[ h(s) = \begin{cases} c_h s, & \text{if $s \ge 0$} \\ -c_s s, & \text{if $s < 0$}, \end{cases}\] where \(c_h\) is the per-unit holding cost, \(c_s\) is the per-unit shortage cost, and \(p\) is the per-unit procurement cost. Note that we have assumed that the holding or shortage cost is discounted because this cost is incurred at the end of the time period.

Recall that in the finite horizon setting, the optimal policy was a base-stock policy characterized by thresholds \(\{σ_t\}_{t \ge 1}\). In the infinite horizon discounted setting, we expect the optimal policy to be time-homogeneous, i.e., the thresholds \(\{σ_t\}_{t \ge 1}\) be a constant \(σ\) and not to depend on time.

As an illustration, let’s reconsider the example used for the finite horizon setting (where \(c_h = 2\), \(c_s = 5\), \(p=1\), and the demand is Binomial(50,0.4)). We consider the discount factor \(γ = 0.9\). The value function and optimal policy is this case are shown below.

(a) Value function
(b) Optimal policy
Figure 17.1: Optimal policy for the model with binomial demand.

We are interested in the following question: Is it possible to identify the optimal threshold of the base-stock policy without explicitly solving the dynamic program? In this section, we show that the answer is affirmative.

As was the case for the finite horizon, we use a post-decision state formulation where we take \(Z_t = S_t + A_t\) as the post-decision state. The corresponding dynamic program is given by the following coupled fixed point equations: \[\begin{align*} H(z) &= \EXP[ h(z - W) + V^*(z - W) ] \\ V^*(s) &= \min_{a \in \reals_{\ge 0}} \bigl\{ pa + γ H(s+a) \} \end{align*}\] with the optimal time-homogeneous policy given by \[ π^*(s) = \arg \min_{a \in \reals_{\ge 0}} \bigl\{ pa + γ H(s+a) \} \]

Suppose we find the fixed point by starting from an initial guess \(V_0(s) = 0\) and iteratively computing \(\{ V_t \}_{t \ge 0}\) and \(\{H_t \}_{t \ge 0}\). Then from the result of Theorem 7.1, we know that \(V_t\) and \(H_t\) are convex and have the structure given in Theorem 7.1. Therefore, the limit will also have these properties.

In particular, define \[ f(z) = pz + γ H(z) \] Then, \[ V^*(s) = \min_{a \in \reals_{\ge 0}} f(s+a) - p s = f(\max\{s, σ\}) - ps \] where \[ σ = \arg \min_{z \in \reals} f(z) \]

Therefore, we have

\[\begin{align*} f(z) &= pz + γH(z) \\ &= pz + γ \EXP[h(z - W)] + γ \EXP[ V^*(z - W)] \\ &= (1 - γ) p z + γ p μ + γ \EXP[ h(z - W) ] + γ \EXP[ f(\max\{z - W, σ\}) ] \end{align*}\] where \(μ = \EXP[W]\). For the ease of notation, define \[ c^*(z) = (1 - γ) p z + γ \EXP[h(z - W)] + γ p μ \] Then, we have \[\begin{equation}\label{eq:F} f(z) = \EXP[c^*(z - W)] + γ \EXP[ f(\max\{s - W, σ\})] \end{equation}\]

The main result is the following.

Lemma 17.1 \(f(z)\) and \(\EXP[c^*(z-W)]\) have the same minimizer, i.e., \[ \arg\min_{z \in \reals}f(z) = \arg\min_{z \in \reals}\EXP[c^*(z-W)] \] Therefore, \[ σ = \arg\min_{z \in \reals}\EXP[c^*(z-W)] \]

For the ease of notation, we define \(g(z) = \EXP[c^*(z - W)]\) and \(\bar f(z) = \EXP[f(\max\{z - W, σ\})\). Then, \(\eqref{eq:F}\) can be written as \[ f(z) = g(z) + γ \bar f(z) \] We know that all three functions are convex. Moreover, \(σ\) is a minimizer of \(f\). Thus, \[ 0 \in ∂ f(σ) \iff f'_{-}(σ) \le 0 \le f'_{+}(σ). \] We will show that \[ 0 \in ∂ g(σ) \iff g'_{-}(σ) \le 0 \le g'_{+}(σ). \]

Left derivative: Since the demand \(W \ge 0\), we have \(\bar f(z) = f(σ)\) for \(z \le σ\). Thus, \[ f(z) = g(z) + γ f(σ), \quad z \le σ \] which implies that \[ g'_{-}(z) = f'_{-}(z) = 0. \]

Right derivative: By dominated convegence theorem, we can interchange derivatives and expectation. Therefore, \[ \bar f'_{+}(z) = \EXP[ f'_{+}(z - W) \IND\{ z - W \ge σ \} ], \quad z \ge σ. \] Thus, \[ \bar f'_{+}(σ) = f'_{+}(σ) \PR(W = 0). \]

Taking the right derivative of \(f(z) = g(z) + γ \bar f(z)\), we get \[ f'_{+}(σ) = g'_{+}(σ) + γ f'_{+}(σ) \PR(W = 0) \implies g'_{+}(σ) = f'_{+}(σ)(1 - γ \PR(W = 0)) \] We know that \(f'_{+}(σ) \ge 0\) and \((1 - γ\PR(W = 0)) \ge 0\). Thus, \[ g'_{+}(σ) \ge 0. \]

Hence, \(σ\) is a minimizer of \(g\).

Theorem 17.1 The optimal threshold \(σ\) is given by the value of \(σ\) which minimizes \(\EXP[ c^*(σ-W) ]\). In particular, if \(F\) denotes the CDF of the demand, then \[\begin{equation} σ = F^{-1}\left( \frac{c_s - p(1-γ)/γ}{c_h + c_s} \right). \label{eq:opt-threshold} \end{equation}\]

NoteProof

The proof idea is the same approach as that used for the newsvendor problem. In particular, let \(f_W\) denote the distribution of \(W\), \(F\) denote the CDF, and \(μ = \EXP[W]\). Then, \[\begin{align*} \EXP[c^*(σ-W)] &= γ\EXP[h(σ-W)] + (1-γ) \EXP[p(σ - W)] + γ p μ \\ &= γ \EXP[h(σ - W)] + (1-γ)p σ + p μ \\ &= γ \int_{0}^{σ} c_h(σ-w)f_W(w)dw + γ \int_{σ}^{∞} c_s(w-σ)f_W(w)dw \\ &\quad + (1-γ)p σ + p μ \end{align*}\] Thereore, \[\begin{align*} \frac{∂ \EXP[c^*(σ-W)]}{∂σ} &= γ \int_{0}^{σ} c_h f_W(w)dw + γ\int_{σ}^{\infty}[-c_s] f_W(w)dw + (1-γ)p \\ &= γ c_h F(σ) - γ c_s(1 - F(σ)) + (1-γ) p \end{align*}\] To find the optimal threshold, we set the derivative to \(0\) and simplify to obtain \eqref{eq:opt-threshold}.

We can use \eqref{eq:opt-threshold} to find the optimal threshold for the example used above. In particular, we need to find the value of \(σ\) at which the CDF of the demand equals the value \((c_s - p(1-γ)/γ)/(c_h + c_s)\), as shown in Figure 17.2 below.

Figure 17.2: The optimal threshold for the example shown above.

Exercises

Exercise 17.1 Suppose that the arrival process is exponential with rate \(1/\mu\), i.e., the density of \(W\) is given by \(e^{-s/\mu}/\mu\). Show that the optimal threshold is given by \[ σ = \mu \log \left[ \frac{ c_h + c_s} { c_h + p (1-γ)/γ} \right]. \]

Hint: Recall that the CDF the exponential distribution is \(F(s) = 1 - e^{-s/μ}\).

The plots below verify this result numerically. We simulate the system with parameters \(μ = 5\), \(p = 1\), \(γ = 0.9\).

Figure 17.3: Optimal policy for model with exponential demand. The black curve shows the optimal policy computed via value iteration and the red line shows the threshold computed via the formula of Exercise 17.1.

Notes

As far as I know, the explicit formula of the threshold presented in Theorem 17.1 has not appeared in the literature before. A partial version of this result is presented in Whittle (1982), using essentially the idea of reward shaping. It is interesting to note that Whittle (1982) uses the idea of reward shaping more than 17 years before the paper by Ng et al. (1999) on reward shaping. It is possible that Whittle was using the results of Porteus (1975).

As established in the notes on Lipschitz MDPs, it can be shown that the optimal value function for the inventory management model above is Lipschitz continuous.

In the analysis above, we did not formally verify that all the functions exist and infinimizations were well defined. For a detailed discussion of such technical details, see Feinberg (2016).

For the models with positive ordering cost an efficient algorithm for computing the optimal \((s,S)\) policy is presented in Zheng and Federgruen (1991).