6 Optimal gambling

Author

Affiliation

Aditya Mahajan

McGill University

Updated

January 27, 2024

Keywords

MDPs, Dynamic programming, Optimal gambling, Kelly strategy

How to gamble if you must — Image credit: http://commons.wikimedia.org/wiki/File:Gambling-ca-1800.jpg

Summary

This stylized model of optimal gambling was introduced by Kelly (1956) to highlight a relationship between channel capacity (which had been proposed recently by Shannon), and gambling. Our motivation for studying this model is to use it as an illustrative example to show that sometimes it is possible to identify the optimal strategy and value function of MDPs in closed form.

Imagine a gambler who goes to a casino with an initial fortune of $s_1$ dollars and places bets over time and must leave after $T$ bets. Let $S_t$ denote the gambler’s fortune after $t$ bets. In this example, time denotes the number of times that the gambler has bet.

At time $t$, the gambler may place a bet for any amount $A_t$ less than or equal to his current fortune $S_t$. If he wins the bet (denoted by the event $W_t = 1$), the casino gives him the amount that he had bet. If he loses the bet (denoted by the event $W_t = -1$), he pays the casino the amount that he had bet. Thus, the dynamics can be written as

\[ S_{t+1} = S_t + W_t A_t. \]

The outcomes of the bets $\{W_t\}_{t \ge 1}$ are primitive random variables, i.e., they are independent of each other, the gambler’s initial fortune, and his betting strategy.

The gambler’s utility is $\log S_T$, the logarithm of his final fortune.¹ Thus, the reward function may be written as

¹ See the notes on risk sensitive utility for a discussion on the choice of the utility function but one way to think of the reward is as follows. Suppose we put $S_1$ dollars in a bank and accrue continuously compounded interest of $R$ per unit time, we will get $S_1 e^{RT}$ dollars after $T$ time steps. Thus, we may view $\log(S_T/S_1)$ to be the effective (random) interest rate.

For an interesting discussion on the impact of the utility function on optimal gambling, see this post on less wrong

\[ r_t(s, a) = 0 \quad \text{and} \quad r_T(s) = \log s. \]

Find the strategy that maximizes the gambler’s utility, $\EXP[\log S_T]$.

6.1 Computational experiment

To fix ideas, let’s try to find the optimal policy on our own. An example strategy is given below.

viewof code = Inputs.textarea({label: "", height:800, rows:11, width: 800, submit: true,
   value: `// function bet(t, states, outcomes) {
// t: current time
// states: Array of states
// outcomes: Array of outcomes
// 
// modify the (javascript) code between the lines:
// ===============================
     // As an illustration, we implement the policy to bet
     //  half of the wealth as long as one is winning. 
     if(t == 0) { 
        return 0.5*states[t] 
     } else { 
        return outcomes[t-1] == 1 ? 0.5*states[t] : 0
     }
// ================================
//}`
                              })
viewof strategy = Inputs.radio(["user code", "optimal"], {value: "user code", label: "Select strategy"})

T = 100
n = 25
S1 = 100

Bernoulli = function(p) { return Math.random() <= p ? 1 : -1 }

user_strategy = new Function('t', 'states', 'outcomes', code)

optimal_strategy = function(t,states,outcomes) {
  return p < 0.5 ? 0 : (2*p - 1)*states[t]
}

bet = function(t, states, outcomes) {
  return strategy == "optimal" ? optimal_strategy(t, states, outcomes) : user_strategy(t, states, outcomes) 
}

data = { 
  run;
  var states = new Array(T+1)
  var outcomes = new Array(T+1)
  var trajectory = new Array(T+1)
  var sum = 0

  const initial = 100
  var idx = 0

  for (var i = 0; i < n; i++) {
      // Initialize the array to NaN values.
      for (var t = 0; t < T+1; t++) {
        states[t] = NaN
        outcomes[t] = Bernoulli(p)
      }
    
      states[0] = initial
      var action = 0
    
      for (var t = 0; t < T; t++, idx++) {
        action = bet(t, states, outcomes)
        states[t+1] = states[t] + outcomes[t] * action
        trajectory[idx] = { 
          time: t+1, 
          state: states[t],
          action: action, 
          outcome: outcomes[t],
          reward: Math.log10(states[t]),
          sample: i,
        }
      }
      sum += Math.log10(states[T])
  }
  return { trajectories: trajectory, mean: sum/n }
}

Assuming that $S_1 = $ , we plot the performance of this policy below. Choosing “optimal” in the radio button above gives the performance of the optimal policy (derived below).

viewof p = Inputs.range([0, 1], {value: 0.6, label: "p", step: 0.01})

viewof run = Inputs.button("Re-run simulation")

rewardPlot = Plot.plot({
  grid: true,
  marginRight: 40,
  marks: [
    // Data
    Plot.line(data.trajectories, {x: "time", y: "reward", z: "sample", stroke: "gray", curve: "step-after"}),
    Plot.line(data.trajectories, Plot.groupX({y: "mean"}, {x:"time", y: "reward", stroke: "red", strokeWidth: 2, curve: "step-after"})),

    // Final value
    Plot.dot([ [T,data.mean] ], { fill: "red"}),
    Plot.text([ [T,data.mean] ], { text: Math.round(data.mean*100)/100, dx:18, fill:"red", fontWeight:"bold" }),
    // Axes
    Plot.ruleX([0]),
    Plot.ruleY([0]),
  ]
})

Figure 6.1: Plot of the performance of the strategy for a horizon of $T=$ . The curves in gray show the performance over $n = $ difference sample paths and the red curve shows its mean. For ease of visualization, we are plotting the utility at each stage (i.e., $\log s_t$), even though the reward is only received at the terminal time step. The red line shows the mean performance over the $n$ sample paths. The final mean value of the reward is shown in red. You can toggle the select strategy button to see how the optimal strategy performs (and how close you came to it).

As we can see, most intuitive policies do not do so well. We will now see how to compute the optimal policy using dynamic programming.

6.2 Optimal gambling strategy and value functions

The above model of optimal gambling is a Markov decision process. Therefore, the optimal solution is given by dynamic programming.

Proposition 6.1 (Dynamic programming decomposition) Define the following value function $V^*_t \colon \reals_{\ge 0} \to \reals$ \[ V^*_T(s) = \log s \] and for $t \in \{T-1, \dots, 1\}$: \[ \begin{align*} Q^*_t(s,a) &= \EXP[ r_t(s,a) + V^*_{t+1}(S_{t+1}) \,|\, S_t = s, A_t = a] \\ &= p V^*_{t+1}(s+a) + (1-p) V^*_{t+1}(s-a), \end{align*} \] and \[ \begin{align*} V^*_t(s) &= \max_{a \in [0, s]} Q^*_t(s,a), \\ π^*_t(s) &= \arg \max_{a \in [0, s]} Q^*_t(s,a). \\ \end{align*} \]

Then the strategy $π^* = (π^*_1, \dots, π^*_{T-1})$ is optimal.

Remark

The above model is one of the rare instances when the optimal strategy and the optimal strategy and value function of an MDP can be identified in closed form.

Theorem 6.1 (Optimal gambling strategy) When $p \le 0.5$:

the optimal strategy is to not gamble, specifically $π^*_t(s) = 0$;
the value function is $V^*_t(s) = \log s$.

When $p > 0.5$:

the optimal strategy is to bet a fraction of the current fortune, specifically $π^*_t(s) = (2p - 1)s$;
the value function is $V^*_t(s) = \log s + (T - t) C$, where \[ C = \log 2 + p \log p + (1-p) \log (1-p).\]

The constant $C$ defined in Theorem 6.1 is equal to the capacity of a binary symmetric channel! In fact, the above model was introduced by Kelly (1956) to show a gambling interpretation of information rates.

We prove the two cases separately.

Proof when $p \le 0.5$

Let $p = \PR(W_t = 1)$ and $q = \PR(W_t = -1)$. Then $p \le 0.5$ implies that $p \le 1 - p = q$.

We proceed by backward induction. For $t = T$, we have that $V^*_T(s) = \log s$. This forms the basis of induction. Now assume that for $t+1$, $V^*_{t+1}(s) = \log s$. Now consider

\[ Q^*_t(s,a) = p V^*_{t+1}(s+a) + qV^*_{t+1}(s-a). \]

Differentiating both sides w.r.t. $a$, we get \[ \begin{align*} \frac { \partial Q^*_t(s,a) } {\partial a} &= \frac p { s + a} - \frac q { s - a } \\ & = \frac { (p - q) s - (p + q) a } { s^2 - a^2 } \\ & = \frac { - (q - p) s - a } {s^2 - a^2 } \\ &< 0. \end{align*} \]

This implies that $Q^*_t(s,a)$ is decreasing in $a$. Therefore,

\[ π^*_t(s) = \arg\max_{a \in [0, s]} Q^*_t(s,a) = 0. \]

Moreover, \[ V^*_t(s) = Q^*_t(s, π^*_t(s)) = \log s.\]

This completes the induction step.

Proof when $p > 0.5$

As in the previous case, let $p = \PR(W_t = 1)$ and $q = \PR(W_t = -1)$. Then $p > 0.5$ implies that $p > 1 - p = q$.

We proceed by backward induction. For $t = T$, we have that $V^*_T(s) = \log s$. This forms the basis of induction. Now assume that for $t+1$, $V^*_{t+1}(s) = \log s + (T -t - 1)C$. Now consider

\[ Q^*_t(s,a) = p V^*_{t+1}(s+a) + qV^*_{t+1}(s-a). \]

Setting $\partial Q^*_t(s,a)/\partial a = 0$, we get that the optimal action is

\[ π^*_t(s) = (p-q) s. \]

Note that $(p-q) \in (0,1]$

\[ \frac { \partial^2 Q^*_t(s,a) } {\partial a^2} = - \frac p { (s + a)^2 } - \frac q { (s - a)^2 } < 0; \] hence the above action is indeed the maximizer. Moreover, \[ \begin{align*} V^*_t(s) &= Q^*_t(s, π^*_t(s)) \\ &= p V^*_{t+1}(s + π^*_t(s)) + q V^*_{t+1}( s - π^*_t(s) )\\ &= \log s + p \log (1 + (p-q)) + q \log (1 - (p-q)) + (T - t -1)C \\ &= \log s + p \log 2p + q \log 2q + (T - t + 1)C \\ &= \log s + (T - t) C \end{align*} \]

This completes the induction step.

6.3 Generalized model

Suppose that the terminal reward $r_T(s)$ is monotone increasing² in $s$.

² I use the convention that increasing means weakly increasing. The alternative term non-decreasing implicitly assumes that we are talking about a totally ordered set.

Theorem 6.2 For the generalized optimal gambling problem:

For each $t$, the value function $V^*_t(s)$ is monotone increasing in $s$.
For each $s$, the value function $V^*_t(s)$ is monotone decreasing in $t$.

Proof of monotonicity in $s$

We proceed by backward induction. $V^*_T(s) = r_T(s)$ which is monotone increasing in $s$. Assume that $V^*_{t+1}(s)$ is increasing in $s$. Now, consider $V^*_t(s)$. Consider $s_1, s_2 \in \reals_{\ge 0}$ such that $s_1 \le s_2$. Then for any $a \le s_1$, we have that

\[ \begin{align*} Q^*_t(s_1, a) &= p V^*_{t+1}(s_1+a) + q V^*_{t+1}(s_1-a) \\ & \stackrel{(a)}{\le} p V^*_{t+1}(s_2 + a) + q V^*_{t+1}(s_2 - a) \\ & = Q^*_t(s_2, a), \end{align*} \] where $(a)$ uses the induction hypothesis. Now consider

\[ \begin{align*} V^*_t(s_1) &= \max_{a \in [0, s_1]} Q^*_t(s_1, a) \\ & \stackrel{(b)}{\le} \max_{a \in [0, s_1]} Q^*_t(s_2, a) \\ & \le \max_{a \in [0, s_2]} Q^*_t(s_2, a) \\ &= V^*_t(s_2), \end{align*} \] where $(b)$ uses monotonicity of $Q^*_t$ in $s$. This completes the induction step.

Proof of monotonicity in $t$

This is a simple consequence of the following:

\[V^*_t(s) = \max_{a \in [0, s]} Q^*_t(s,a) \ge Q^*_t(s,0) = V^*_{t+1}(s).\]

Exercises

Note

The purpose of these series of exercises is to generalize the basic result to a model where the gambler can bet on many mutually exclusive outcomes (think of betting on multiple horses in a horse race).

Exercise 6.1 Given positive numbers $(p_1, \dots, p_n)$, consider the following constraint optimization problem: \[\max \sum_{i=1}^n p_i \log w_i\] subject to:

$w_i \ge 0$
$\sum_{i=1}^n w_i \le s$.

Show that the optimal solution is given by \[ w_i = \frac{p_i}{p} s\] where $p = \sum_{i=1}^n p_i$.

Exercise 6.2 Given positive numbers $(p_1, \dots, p_n)$, consider the following constraint optimization problem: \[\max \sum_{i=1}^n p_i \log (s - a + na_i)\] subject to:

$a_i \ge 0$
$a = \sum_{i=1}^n a_i \le s$.

Show that the optimal solution is given by \[ a_i = \frac{p_i}{p} s\] where $p = \sum_{i=1}^n p_i$.

Exercise 6.3 Consider an alternative of the optimal gambling problem where, at each time, the gambler can place bets on many mutually exclusive outcomes. Suppose there are $n$ outcomes, with success probabilities $(p_1, \dots, p_n)$. Let $(A_{1,t}, \dots, A_{n,t})$ denote the amount that the gambler bets on each outcome. The total amount $A_t := \sum_{i=1}^n A_{i,t}$ must be less than the gambler’s fortune $S_t$. If $W_t$ denotes the winning outcome, then the gambler’s wealth evolves according to \[ S_{t+1} = S_t - A_t + nU_{W_t, t}.\] For example, if there are three outcomes, gambler’s current wealth is $s$, the gambler bets $(a_1, a_2, a_3)$, and outcome 2 wins, then the gambler wins $3 a_2$ and his fortune at the next time is \[ s - (a_1 + a_2 + a_3) + 3 a_2. \]

The gambler’s utility is $\log S_T$, the logarithm of his final wealth. Find the strategy that maximizes the gambler’s expected utility.

Hint: Argue that the value function is of the form $V^*_t(s) = \log s + (T -t)C$, where \[C = \log n - H(p_1, \dots, p_n)\] where $H(p_1, \dots, p_n) = - \sum_{i=1}^n p_i \log p_i$ is the entropy of a random variable with pmf $(p_1, \dots, p_n)$.

The constant $C$ is the capacity of a symmetric discrete memoryless with $n$ outputs and for every input, the output probabilities are a permutation of $(p_1, \dots, p_n)$.

Notes

The above model (including the model described in the exercise) was introduced by Kelly (1956). However, Kelly restricted attention to “bet a constant fraction of your fortune” betting strategy and found the optimal fraction. This strategy is sometimes referred to as :Kelly criteria. As far as I know, the dynamic programming treatment of the problem is due to Ross (1974). Ross also considered variations where the objective was to maximize the probability of reaching a preassigned fortune or maximizing the time until becoming broke.

A generalization of the above model to general logarithmic and exponential utilities is presented in Ferguson and Gilstein (2004).