ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

Example: The Newsvendor Problem
Image credit:

TL;DR The newsvendor problem is a simple model of stochastic optimization problem where a decision has to be made when there is uncertainty about the outcome. It also shows that for some stochastic optimization problems it is possible to obtain the qualitative properties of the nature of optimal solution.

Each morning, a newsvendor has to decide how many newspapers to buy before knowing the demand during the day. The newsvendor purchases a newspaper at a cost of \(\$p\) per newspaper and sells them at a cost of \(\$q\) per newspaper, where \(q > p\). Any unsold newspapers at the end of the day have no salvage value.

Let \(a\) denote the number of newspapers bought and \(W\) denotes the demand. If \(W < a\), then the newsvendor will sell \(W\) newspapers and receive a total earnings of \(q W - p a\). If \(W \ge a\), then the newsvendor will sell \(a\) newspapers and receive a total earning of \(q a - p a\). Thus, the reward is \(r(a,W)\), where

\[r(a, w) = \begin{cases} q w - p a, & \text{if } w < a, \\ q a - p a, & \text{if } w \ge a. \end{cases} \]

An illustration of the reward function is shown below.

Figure 1
Reward as a function of \(w\) for a fixed value of \(a\). You can change the values of \(p\) and \(a\) by moving the sliders. Here \(q = 1\).

The problem above has discrete action and discrete demand. To build intuition, we first consider the case where both the actions and demand are continuous. Let \(f(w)\) denote the probability density of the demand and \(F(w)\) denote the cumulative probability density. Then, the expected reward is \[ \begin{equation} \label{eq:J} J(a) = \int_{0}^a [ q w - p a ] f(w) dw + \int_{a}^\infty [ q a - p a ] f(w) dw. \end{equation}\]

As an illustration, the PDF of a particular distribution (see the plot later for exact choices of parameters) is shown below. Try to move the slider for \(a\) to try and find an (approximately) optimal value of the action.

Figure 2
The value of expected cost as a function of action. In the first plot, the blue shaded region represents the probability of getting a demand less than the ordered goods and the red shared region represents the probability of getting a demand greater than the ordered goods. The second plot shows the reward function \(r(a,\cdot)\), which depends on the value of \(p\) and \(q\).

We can also plot the performance as a function of action, and identify its maximma as shown in the plot below.

Figure 3
The value of expected cost as a function of action. The blue dot shows the value of action chosen in Figure 2. The red dot shows the optimal value.

The plot of \(J(a)\) is concave. This suggests that we can use calculus to find the optimal value. In particular, to find the optimal action, we need to compute the \(a\) such that \(dJ(a)/da = 0\). We first recall the Leibniz integral rule: \[ \dfrac{d}{dx} \left( \int_{p(x)}^{q(x)} f(x,t) dt \right) = f(x, q(x)) \cdot \dfrac {d}{dx} q(x) - f(x, p(x)) \cdot \dfrac {d}{dx} p(x) + \int_{p(x)}^{q(x)} \dfrac{\partial}{\partial x} f(x,t) dt. \]

Thus, the derivative of the first term of \eqref{eq:J} is \[ [q a - p a ] f(a) + \int_{0}^a [ -p ] f(w) dw = [q a - p a ] f(a) - p F(a). \]

Similarly, the derivative of the second term of \eqref{eq:J} is \[ - [q a - p a] f(a) + \int_{a}^{\infty} (q-p)f(w)dw = - [q a - p a] f(a) + (q -p)[ 1 - F(a)]. \]

Combining the two, we get that \[ \dfrac{dJ(a)}{da} = - p F(a) + (q - p) [ 1 - F(a) ]. \]

Equating this to \(0\), we get \[ F(a) = \dfrac{ q - p }{ q} \quad\text{or}\quad a = F^{-1} \left( \dfrac{ q - p }{ q } \right). \] In the literature, the quantity \((q-p)/q\) is called the critical fractile.

An illustration of the optimal decision rule is shown below.

Figure 4
The value of optimal action for different values of \(p\). The optimal action is such that the area of the red shaded region in the curve equals the critical factile value \((q - p)/q\).
In this example, it is assumed that \(W\) is distributed according to Kumaraswamy distribution with parameters \((a,b) = (2,5)\) and support \([0,100]\). The value of \(q=1\).

Strictly speaking, we also need to verify that the function \(J(a)\) is concave. To do so, we compute the second derivative: \[ \frac{d^2 J(a)}{da^2} = - p f(a) - (q - p) f(a) = -q f(a) \le 0. \] Hence, the \(J(a)\) is concave and the extremum value computed above is a maximum.

Now, we come back to the problem with discrete actions and discrete demand. Suppose \(W\) takes the values \(\ALPHABET W = \{ w_1, w_2, \dots, w_k \}\) (where \(w_1 < w_2 < \cdots < w_k\)) with probabilities \(\{ μ_1, μ_2, \dots, μ_k \}\). It is ease to see that in this case the action \(a\) should be in the set \(\{ w_1, w_2, \dots, w_k \}\).

As an illustration, consider the case when \(\ALPHABET W = \{0, 1, \dots, 100\}\).

Figure 5
The value of expected cost as a function of action. This is similar to Figure 1, except that demand and actions can take discrete values.

In the discrete case, the brute force search is easier (because there are a finite rather than continuous number of values). We cannot directly use the ideas from calculus because functions over discrete domain are not differentiable. But we can use a very similar idea. Instead of checking if \(dJ(a)/da = 0\), we check the sign of \(J(w_{i+1}) - J(w_i)\).

The expected reward for choice \(w_i\) is \[ \begin{align*} J(w_i) &= \sum_{j < i} μ_j [ q w_j - p w_i ] + \sum_{j \ge i} μ_j [q w_i - p w_i] \\ &= -p w_i + q \Bigl[ \sum_{j < i} μ_j w_j + \sum_{j \ge i} μ_j w_i \Bigr]. \end{align*}\]

Thus, \[ \begin{align*} J(w_{i+1}) - J(w_i) &= -p w_{i+1} + q \Bigl[ \sum_{j < i+1} μ_j w_j + \sum_{j \ge i+1} μ_j w_{i+1} \Bigr] \\ &\quad + p w_i - q \Bigl[ \sum_{j < i} μ_j w_j + \sum_{j \ge i} μ_j w_i \Bigr] \\ &= -p (w_{i+1} - w_i) + q \Bigl[ \sum_{j \ge i + 1} μ_j ( w_{i+1} - w_i) \Bigr] \\ &= \big( - p + q [ 1 - M_i ] \big) (w_{i+1} - w_i), \end{align*}\] where \(M_i\) is the cumulative probability mass function. Note that \[ M_i \le \dfrac{q-p}{q} \iff -p + q [ 1 - M_i ] \ge 0. \] Thus, for all \(i\) such that \(M_i \le (q-p)/q\), \(J(w_{i+1}) \ge J(w_i)\). On the other hand, for all \(i\) such that \(M_i > (q-p)/q)\), \(J(w_{i+1}) < J(w_i)\). Thus, the optimal amount to order is the largest \(w_i\) such that \(M_i \le (q-p)/q\).

Note that the structure of the optimal solution is the same for continuous and discrete demand distributions.


  1. Qualitative properties of optimal solution. The numerical results shown in Figure 4 suggest that, for a fixed value of \(q\), the optimal value of \(a\) is a decreasing function of \(p\). Intuitively, this makes sense: if the purchase price of the newspaper increases, but the selling price remains the same, then the newsvendor should buy less newspapers. Formally prove this statement.

    Hint: The CDF of a distribution is a weakly increasing function.

  2. Monotonicity of optimal action. Consider two scenarios for the case with continuous demand and actions. In scenario 1, the demand is distributed according to PDF \(f_1\). In scenario 2, it is distributed according to PDF \(f_2\). Suppose \(F_1(w) \le F_2(w)\) for all \(w\). Show that the optimal action \(a_1\) for scenario 1 is greater than the optimal action \(a_2\) for scenario 2.

    Hint: Plot the two CDFs and try to interpret the optimal decision rule graphically.

  3. Selling random wind. The amount \(W\) of power generated by the wind turbine is a positive real-valued random variable with probability density function \(f\). The operator of the wind turbine has to commit to provide a certain amount of power in the day-ahead market. The price of power is \(\$p\) per MW.

    If the operator commits to provide \(a\) MW of power and the wind generation \(W\) is less than \(a\), then he has to buy the balance \(a - W\) from a reserves market at the cost of \(\$ q\) per unit, where \(q > p\). Thus, the reward of the operator is \(r(a,W)\) where \[ r(a, w) = \begin{cases} p a, & \text{if } w > a \\ p a - q (a - w), & \text{if } w < a. \end{cases}\]

    Find the value of commitment \(a\) that maximizes the expected reward.


Perhaps the earliest model of the newsvendor problem appeared in Edgeworth (1888) in the context of a bank setting the level of cash reserves to cover demands from its customers. The solution to the basic model presented above and some of its variants was provided in Morse and Kimball (1951); Arrow et al. (1952); Whitin (1953). See Porteus (2008) for an accessible introduction.

The property \(F_1(w) \le F_2(w)\) used in Exercise 2 is called stochastic dominance. Later in the course, we will study how stochastic dominance is useful to establish monotonicity properties of general MDPs.

The example of selling random wind in Exercise 3 is taken from Bitar et al. (2012).

Arrow, K.J., Harris, T., and Marschak, J. 1952. Optimal inventory policy. Econometrica 20, 1, 250–272. DOI: 10.2307/1907830.
Bitar, E., Poolla, K., Khargonekar, P., Rajagopal, R., Varaiya, P., and Wu, F. 2012. Selling random wind. 2012 45th hawaii international conference on system sciences, IEEE, 1931–1937.
Edgeworth, F.Y. 1888. The mathematical theory of banking. Journal of the Royal Statistical Society 51, 1, 113–127. Available at:
Morse, P. and Kimball, G. 1951. Methods of operations research. Technology Press of MIT.
Porteus, E.L. 2008. Building intuition: Insights from basic operations management models and principles. In: D. Chhajed and T.J. Lowe, eds., Springer, 115–134. DOI: 10.1007/978-0-387-73699-0.
Whitin, S. 1953. The theory of inventory management. Princeton University Press.

This entry was last updated on 11 Jan 2022 and posted in Stochastics and tagged stochastic optimization.