ECSE 506: Stochastic Control and Decision Theory
Aditya Mahajan
Winter 2022
About  Lectures  Notes  Coursework
TL;DR This stylized model of optimal gambling was introduced by Kelly (1956) to highlight a relationship between channel capacity (which had been proposed recently by Shannon), and gambling. Our motivation for studying this model is to use it as an illustrative example to show that sometimes it is possible to identify the optimal strategy and value function of MDPs in closed form.
Imagine a gambler who goes to a casino with an initial fortune of \(s_1\) dollars and places bets over time and must leave after \(T\) bets. Let \(S_t\) denote the gambler’s fortune after \(t\) bets. In this example, time denotes the number of times that the gambler has bet.
At time \(t\), the gambler may place a bet for any amount \(A_t\) less than or equal to his current fortune \(S_t\). If he wins the bet (denoted by the event \(W_t = 1\)), the casino gives him the amount that he had bet. If he loses the bet (denoted by the event \(W_t = 1\)), he pays the casino the amount that he had bet. Thus, the dynamics can be written as
\[ S_{t+1} = S_t + W_t A_t. \]
The outcomes of the bets \(\{W_t\}_{t \ge 1}\) are primitive random variables, i.e., they are independent of each other, the gambler’s initial fortune, and his betting strategy.
The gambler’s utility is \(\log S_T\), the logarithm of his final fortune.^{1} Thus, the reward function may be written as
\[ r_t(s, a) = 0 \quad \text{and} \quad r_T(s) = \log s. \]
Find the strategy that maximizes the gambler’s utility, \(\EXP[\log S_T]\).
To fix ideas, let’s try to find the optimal policy on our own. An example strategy is given below.
Assuming that \(S_1 = 100\), we plot the performance of this policy below.

As we can see, most intuitive policies do not do so well. We will now see how to compute the optimal policy using dynamic programming.
1 Optimal gambling strategy and value functions
The above model of optimal gambling is a Markov decision process. Therefore, the optimal solution is given by dynamic programming.
 Dynamic program

Define the following value function \(V_t \colon \reals_{\ge 0} \to \reals\) \[ V_T(s) = \log s \] and for \(t \in \{T1, \dots, 1\}\): \[ \begin{align*} Q_t(s,a) &= \EXP[ r_t(s,a) + V_{t+1}(S_{t+1}) \,\, S_t = s, A_t = a] \\ &= p V_{t+1}(s+a) + (1p) V_{t+1}(sa), \end{align*} \] and \[ \begin{align*} V_t(s) &= \max_{a \in [0, s]} Q_t(s,a), \\ π_t(s) &= \arg \max_{a \in [0, s]} Q_t(s,a). \\ \end{align*} \]
Then the strategy \(π = (π_1, \dots, π_{T1})\) is optimal.
The above model is one of the rare instances when the optimal strategy and the optimal strategy and value function of an MDP can be identified in closed form.
 Theorem

When \(p \le 0.5\):
 the optimal strategy is to not gamble, specifically \(π_t(s) = 0\);
 the value function is \(V_t(s) = \log s\).
When \(p > 0.5\):
 the optimal strategy is to bet a fraction of the current fortune, specifically \(π_t(s) = (2p  1)s\);
 the value function is \(V_t(s) = \log s + (T  t) C\), where^{2} \[ C = \log 2 + p \log p + (1p) \log (1p).\]
We prove the two cases separately.
Proof when \(p \le 0.5\)
Let \(p = \PR(W_t = 1)\) and \(q = \PR(W_t = 1)\). Then \(p \le 0.5\) implies that \(p \le 1  p = q\).
We proceed by backward induction. For \(t = T\), we have that \(V_T(s) = \log s\). This forms the basis of induction. Now assume that for \(t+1\), \(V_{t+1}(s) = \log s\). Now consider
\[ Q_t(s,a) = p V_{t+1}(s+a) + qV_{t+1}(sa). \]
Differentiating both sides w.r.t. \(a\), we get \[ \begin{align*} \frac { \partial Q_t(s,a) } {\partial a} &= \frac p { s + a}  \frac q { s  a } \\ & = \frac { (p  q) s  (p + q) a } { s^2  a^2 } \\ & = \frac {  (q  p) s  a } {s^2  a^2 } \\ &< 0. \end{align*} \]
This implies that \(Q_t(s,a)\) is decreasing in \(a\). Therefore,
\[ π_t(s) = \arg\max_{a \in [0, s]} Q_t(s,a) = 0. \]
Moreover, \[ V_t(s) = Q_t(s, π_t(s)) = \log s.\]
This completes the induction step.
Proof when \(p > 0.5\)
As in the previous case, let \(p = \PR(W_t = 1)\) and \(q = \PR(W_t = 1)\). Then \(p > 0.5\) implies that \(p > 1  p = q\).
We proceed by backward induction. For \(t = T\), we have that \(V_T(s) = \log s\). This forms the basis of induction. Now assume that for \(t+1\), \(V_{t+1}(s) = \log s + (T t  1)C\). Now consider
\[ Q_t(s,a) = p V_{t+1}(s+a) + qV_{t+1}(sa). \]
Differentiating both sides w.r.t. \(a\), we get \[ \begin{align*} \frac { \partial Q_t(s,a) } {\partial a} &= \frac p { s + a}  \frac q { s  a } \\ & = \frac { (p  q) s  (p + q) a } { s^2  a^2 } \\ & = \frac { (p  q) s  a } {s^2  a^2 } \end{align*} \]
Setting \(\partial Q_t(s,a)/\partial a = 0\), we get that the optimal action is
\[ π_t(s) = (pq) s. \]
Note that \((pq) \in (0,1]\)
\[ \frac { \partial^2 Q_t(s,a) } {\partial a^2} =  \frac p { (s + a)^2 }  \frac q { (s  a)^2 } < 0; \] hence the above action is indeed the maximizer. Moreover, \[ \begin{align*} V_t(s) &= Q_t(s, π_t(s)) \\ &= p V_{t+1}(s + π_t(s)) + q V_{t+1}( s  π_t(s) )\\ &= \log s + p \log (1 + (pq)) + q \log (1  (pq)) + (T  t 1)C \\ &= \log s + p \log 2p + q \log 2q + (T  t + 1)C \\ &= \log s + (T  t) C \end{align*} \]
This completes the induction step.
2 Generalized model
Suppose that the terminal reward \(r_T(s)\) is monotone increasing^{3} in \(s\).
 Theorem

For the generalized optimal gambling problem:
 For each \(t\), the value function \(V_t(s)\) is monotone increasing in \(s\).
 For each \(s\), the value function \(V_t(s)\) is monotone decreasing in \(t\).
Proof of monotonicity in \(s\)
We proceed by backward induction. \(V_T(s) = r_T(s)\) which is monotone increasing in \(s\). Assume that \(V_{t+1}(s)\) is increasing in \(s\). Now, consider \(V_t(s)\). Consider \(s_1, s_2 \in \reals_{\ge 0}\) such that \(s_1 \le s_2\). Then for any \(a \le s_1\), we have that
\[ \begin{align*} Q_t(s_1, a) &= p V_{t+1}(s_1+a) + q V_{t+1}(s_1a) \\ & \stackrel{(a)}{\le} p V_{t+1}(s_2 + a) + q V_{t+1}(s_2  a) \\ & = Q_t(s_2, a), \end{align*} \] where \((a)\) uses the induction hypothesis. Now consider
\[ \begin{align*} V_t(s_1) &= \max_{a \in [0, s_1]} Q_t(s_1, a) \\ & \stackrel{(b)}{\le} \max_{a \in [0, s_1]} Q_t(s_2, a) \\ & \le \max_{a \in [0, s_2]} Q_t(s_2, a) \\ &= V_t(s_2), \end{align*} \] where \((b)\) uses monotonicity of \(Q_t\) in \(s\). This completes the induction step.
Proof of monotonicity in \(t\)
This is a simple consequence of the following:
\[V_t(s) = \max_{a \in [0, s]} Q_t(s,a) \ge Q_t(s,0) = V_{t+1}(s).\]
Exercises
The purpose of these series of exercises is to generalize the basic result to a model where the gambler can bet on many mutually exclusive outcomes (think of betting on multiple horses in a horse race).
Given positive numbers \((p_1, \dots, p_n)\), consider the following constraint optimization problem: \[\max \sum_{i=1}^n p_i \log w_i\] subject to:
 \(w_i \ge 0\)
 \(\sum_{i=1}^n w_i \le s\).
Show that the optimal solution is given by \[ w_i = \frac{p_i}{p} s\] where \(p = \sum_{i=1}^n p_i\).
Given positive numbers \((p_1, \dots, p_n)\), consider the following constraint optimization problem: \[\max \sum_{i=1}^n p_i \log (s  a + na_i)\] subject to:
 \(a_i \ge 0\)
 \(a = \sum_{i=1}^n a_i \le s\).
Show that the optimal solution is given by \[ a_i = \frac{p_i}{p} s\] where \(p = \sum_{i=1}^n p_i\).
Consider an alternative of the optimal gambling problem where, at each time, the gambler can place bets on many mutually exclusive outcomes. Suppose there are \(n\) outcomes, with success probabilities \((p_1, \dots, p_n)\). Let \((A_{1,t}, \dots, A_{n,t})\) denote the amount that the gambler bets on each outcome. The total amount \(A_t := \sum_{i=1}^n A_{i,t}\) must be less than the gambler’s fortune \(S_t\). If \(W_t\) denotes the winning outcome, then the gambler’s wealth evolves according to \[ S_{t+1} = S_t  A_t + nU_{W_t, t}.\] For example, if there are three outcomes, gambler’s current wealth is \(s\), the gambler bets \((a_1, a_2, a_3)\), and outcome 2 wins, then the gambler wins \(3 a_2\) and his fortune at the next time is \[ s  (a_1 + a_2 + a_3) + 3 a_2. \]
The gambler’s utility is \(\log S_T\), the logarithm of his final wealth. Find the strategy that maximizes the gambler’s expected utility.
Hint: Argue that the value function is of the form \(V_t(s) = \log s + (T t)C\), where \[C = \log n  H(p_1, \dots, p_n)\] where \(H(p_1, \dots, p_n) =  \sum_{i=1}^n p_i \log p_i\) is the entropy of a random variable with pmf \((p_1, \dots, p_n)\).
 Side Remark

The constant \(C\) is the capacity of a symmetric discrete memoryless with \(n\) outputs and for every input, the output probabilities are a permutation of \((p_1, \dots, p_n)\).
References
The above model (including the model described in the exercise) was introduced by Kelly (1956). However, Kelly restricted attention to “bet a constant fraction of your fortune” betting strategy and found the optimal fraction. This strategy is sometimes referred to as Kelly criteria. As far as I know, the dynamic programming treatment of the problem is due to Ross (1974). Ross also considered variations where the objective was to maximize the probability of reaching a preassigned fortune or maximizing the time until becoming broke.
A generalization of the above model to general logarithmic and exponential utilities is presented in Ferguson and Gilstein (2004).
See the notes on risk sensitive utility for a discussion on the choice of the utility function.↩︎
The constant \(C\) defined above is equal to the capacity of a binary symmetric channel! In fact, the above model was introduced by Kelly (1956) to show a gambling interpretation of information rates.↩︎
I use the convention that increasing means weakly increasing. The alternative term nondecreasing implicitly assumes that we are talking about a totally ordered set.↩︎
This entry was last updated on 27 Aug 2022 and posted in MDP and tagged gambling, closed form solution, structural result, information theory.