viewof code = Inputs.textarea({label: "", height:800, rows:11, width: 800, submit: true,
value: `// function bet(t, states, outcomes) {
// t: current time
// states: Array of states
// outcomes: Array of outcomes
//
// modify the (javascript) code between the lines:
// ===============================
// As an illustration, we implement the policy to bet
// half of the wealth as long as one is winning.
if(t == 0) {
return 0.5*states[t]
} else {
return outcomes[t-1] == 1 ? 0.5*states[t] : 0
}
// ================================
//}`
})
viewof strategy = Inputs.radio(["user code", "optimal"], {value: "user code", label: "Select strategy"})
5 Optimal gambling
Imagine a gambler who goes to a casino with an initial fortune of \(s_1\) dollars and places bets over time and must leave after \(T\) bets. Let \(S_t\) denote the gambler’s fortune after \(t\) bets. In this example, time denotes the number of times that the gambler has bet.
At time \(t\), the gambler may place a bet for any amount \(A_t\) less than or equal to his current fortune \(S_t\). If he wins the bet (denoted by the event \(W_t = 1\)), the casino gives him the amount that he had bet. If he loses the bet (denoted by the event \(W_t = -1\)), he pays the casino the amount that he had bet. Thus, the dynamics can be written as
\[ S_{t+1} = S_t + W_t A_t. \]
The outcomes of the bets \(\{W_t\}_{t \ge 1}\) are primitive random variables, i.e., they are independent of each other, the gambler’s initial fortune, and his betting strategy.
The gambler’s utility is \(\log S_T\), the logarithm of his final fortune.1 Thus, the reward function may be written as
1 See the notes on risk sensitive utility for a discussion on the choice of the utility function but one way to think of the reward is as follows. Suppose we put \(S_1\) dollars in a bank and accrue continuously compounded interest of \(R\) per unit time, we will get \(S_1 e^{RT}\) dollars after \(T\) time steps. Thus, we may view \(\log(S_T/S_1)\) to be the effective (random) interest rate.
\[ r_t(s, a) = 0 \quad \text{and} \quad r_T(s) = \log s. \]
Find the strategy that maximizes the gambler’s utility, \(\EXP[\log S_T]\).
5.1 Computational experiment
To fix ideas, let’s try to find the optimal policy on our own. An example strategy is given below.
Assuming that $S_1 = $ , we plot the performance of this policy below. Choosing “optimal” in the radio button above gives the performance of the optimal policy (derived below).
As we can see, most intuitive policies do not do so well. We will now see how to compute the optimal policy using dynamic programming.
5.2 Optimal gambling strategy and value functions
The above model of optimal gambling is a Markov decision process. Therefore, the optimal solution is given by dynamic programming.
Proposition 5.1 (Dynamic programming decomposition) Define the following value function \(V_t \colon \reals_{\ge 0} \to \reals\) \[ V_T(s) = \log s \] and for \(t \in \{T-1, \dots, 1\}\): \[ \begin{align*} Q_t(s,a) &= \EXP[ r_t(s,a) + V_{t+1}(S_{t+1}) \,|\, S_t = s, A_t = a] \\ &= p V_{t+1}(s+a) + (1-p) V_{t+1}(s-a), \end{align*} \] and \[ \begin{align*} V_t(s) &= \max_{a \in [0, s]} Q_t(s,a), \\ π_t(s) &= \arg \max_{a \in [0, s]} Q_t(s,a). \\ \end{align*} \]
Then the strategy \(π = (π_1, \dots, π_{T-1})\) is optimal.
Theorem 5.1 (Optimal gambling strategy) When \(p \le 0.5\):
- the optimal strategy is to not gamble, specifically \(π_t(s) = 0\);
- the value function is \(V_t(s) = \log s\).
When \(p > 0.5\):
- the optimal strategy is to bet a fraction of the current fortune, specifically \(π_t(s) = (2p - 1)s\);
- the value function is \(V_t(s) = \log s + (T - t) C\), where \[ C = \log 2 + p \log p + (1-p) \log (1-p).\]
The constant \(C\) defined in Theorem 5.1 is equal to the capacity of a binary symmetric channel! In fact, the above model was introduced by Kelly (1956) to show a gambling interpretation of information rates.
We prove the two cases separately.
5.3 Generalized model
Suppose that the terminal reward \(r_T(s)\) is monotone increasing2 in \(s\).
2 I use the convention that increasing means weakly increasing. The alternative term non-decreasing implicitly assumes that we are talking about a totally ordered set.
Theorem 5.2 For the generalized optimal gambling problem:
- For each \(t\), the value function \(V_t(s)\) is monotone increasing in \(s\).
- For each \(s\), the value function \(V_t(s)\) is monotone decreasing in \(t\).
Exercises
Exercise 5.1 Given positive numbers \((p_1, \dots, p_n)\), consider the following constraint optimization problem: \[\max \sum_{i=1}^n p_i \log w_i\] subject to:
- \(w_i \ge 0\)
- \(\sum_{i=1}^n w_i \le s\).
Show that the optimal solution is given by \[ w_i = \frac{p_i}{p} s\] where \(p = \sum_{i=1}^n p_i\).
Exercise 5.2 Given positive numbers \((p_1, \dots, p_n)\), consider the following constraint optimization problem: \[\max \sum_{i=1}^n p_i \log (s - a + na_i)\] subject to:
- \(a_i \ge 0\)
- \(a = \sum_{i=1}^n a_i \le s\).
Show that the optimal solution is given by \[ a_i = \frac{p_i}{p} s\] where \(p = \sum_{i=1}^n p_i\).
Exercise 5.3 Consider an alternative of the optimal gambling problem where, at each time, the gambler can place bets on many mutually exclusive outcomes. Suppose there are \(n\) outcomes, with success probabilities \((p_1, \dots, p_n)\). Let \((A_{1,t}, \dots, A_{n,t})\) denote the amount that the gambler bets on each outcome. The total amount \(A_t := \sum_{i=1}^n A_{i,t}\) must be less than the gambler’s fortune \(S_t\). If \(W_t\) denotes the winning outcome, then the gambler’s wealth evolves according to \[ S_{t+1} = S_t - A_t + nU_{W_t, t}.\] For example, if there are three outcomes, gambler’s current wealth is \(s\), the gambler bets \((a_1, a_2, a_3)\), and outcome 2 wins, then the gambler wins \(3 a_2\) and his fortune at the next time is \[ s - (a_1 + a_2 + a_3) + 3 a_2. \]
The gambler’s utility is \(\log S_T\), the logarithm of his final wealth. Find the strategy that maximizes the gambler’s expected utility.
Hint: Argue that the value function is of the form \(V_t(s) = \log s + (T -t)C\), where \[C = \log n - H(p_1, \dots, p_n)\] where \(H(p_1, \dots, p_n) = - \sum_{i=1}^n p_i \log p_i\) is the entropy of a random variable with pmf \((p_1, \dots, p_n)\).
The constant \(C\) is the capacity of a symmetric discrete memoryless with \(n\) outputs and for every input, the output probabilities are a permutation of \((p_1, \dots, p_n)\).
Notes
The above model (including the model described in the exercise) was introduced by Kelly (1956). However, Kelly restricted attention to “bet a constant fraction of your fortune” betting strategy and found the optimal fraction. This strategy is sometimes referred to as :Kelly criteria. As far as I know, the dynamic programming treatment of the problem is due to Ross (1974). Ross also considered variations where the objective was to maximize the probability of reaching a preassigned fortune or maximizing the time until becoming broke.
A generalization of the above model to general logarithmic and exponential utilities is presented in Ferguson and Gilstein (2004).