Reinforcement learning in restless bandits

Scalable regret via Thompson sampling

Aditya Mahajan

McGill University

20 Nov, 2023

Motivation

  • Restless multi-armed bandits are a popular framework for solving resource allocation and scheduling problems.

Salient features

  • Whittle index policy avoids the curse of dimensionality (in # of arms)
  • It is a heuristic, but theoretically motivated and performs well in a large class of models. See Gast et al. (2023) for asymptotic optimality guarantees.

Limitation

  • Assumes a perfect knowledge of the system model!

\(\def\EXP{\mathbb{E}} \def\IND{\mathbb{1}} \def\PR{\mathbb{P}} \def\ALPHABET{\mathcal} \def\S{\mathsf{S}} \def\A{\mathsf{A}} \def\REGRET{\mathcal{R}} \def\bs{\boldsymbol} \def\OO{\tilde{\mathcal{O}}}\)

RMAB with unknown system model

  • Suppose the dynamics of the arms are unknown.

  • Our motivation: Using RMABs for operator allocation in human-robot teams.

  • Can we use a reinforcement learning (RL) to learn good policies?

Two challenges

  • Naive model-free RL cannot exploit the structure of RMAB … but see Avrachenkov and Borkar (2022) for using QL to learn WI.

  • The curse of dimensionality pops up again when we consider regret

Review: Regret of RL in MDPs

  • Consider average cost MDP \(\langle \mathcal{S}, \mathcal A, P, r\rangle\) with optimal performance: \[ J^\star = \sup_{ \pi \in \Pi} \lim_{T \to \infty} \frac 1T \EXP^{\pi} \biggl[ \sum_{t=1}^T r(S_t, A_t) \biggr] \]
  • Regret of a history-dependent and randomized learning policy \(π\): \[ \mathcal R(T; π) = \EXP^{\pi}\biggl[ T J^\star - \sum_{t=1}^T r(S_t, A_t) \biggr]. \]

  • Various characterizations of regret in the literature: UCRL, REGAL, SCAL, TSDE, …

Review: Regret of RL in MDPs

  • Lower bound \(\tilde Ω(\underline κ \sqrt{\S\A T})\)
  • Upper bound \(\OO(\bar κ S \sqrt{\A T})\)
  • \(\underline κ\) and \(\bar κ\) are model dependent constants.

Implication for RMAB

  • Consider an RMAB with \(n\) arms, at most \(m\) can be activated \[ \text{Regret} = \OO\biggl( \prod_{i=1}^n \S_i \sqrt{ \binom{n}{m} T } \biggr) \]
  • Regret upper bound exponential in \(n\) (curse of dimensionality).

Are we doomed?

Does the regret definition make sense?

  • Why consider regret w.r.t. the optimal policy?

  • Better to consider regret w.r.t. Whittle index policy

    \[ \mathcal R(T; π) = \EXP^{\pi}\biggl[ T J(\mu) - \sum_{t=1}^T r(S_t, A_t) \biggr] \] where \(μ\) is the Whittle index policy.

  • How does this regret scale with number of arms?

Let’s check that experimentally

  • Environment A: Machine maintenance model

    • Machines deteriorate stochastically over time
    • Incur a cost that depends on the state of the machine
    • One repairman, who can replace a machine at a cost.
  • Environment B: Link scheduling model

    • Multiple queues communicating on a common channel
    • Only one queue can be scheduled
    • Others incur a holding cost depending on queue length

Regret of QWI on Env A

Regret/\(\sqrt{T}\) vs \(T\)

Regret vs \(n\)

Regret of QWI on Env B