Scalable regret via Thompson sampling
McGill University
20 Nov, 2023
\(\def\EXP{\mathbb{E}} \def\IND{\mathbb{1}} \def\PR{\mathbb{P}} \def\ALPHABET{\mathcal} \def\S{\mathsf{S}} \def\A{\mathsf{A}} \def\REGRET{\mathcal{R}} \def\bs{\boldsymbol} \def\OO{\tilde{\mathcal{O}}}\)
Suppose the dynamics of the arms are unknown.
Our motivation: Using RMABs for operator allocation in human-robot teams.
Naive model-free RL cannot exploit the structure of RMAB … but see Avrachenkov and Borkar (2022) for using QL to learn WI.
The curse of dimensionality pops up again when we consider regret
Regret of a history-dependent and randomized learning policy \(π\): \[ \mathcal R(T; π) = \EXP^{\pi}\biggl[ T J^\star - \sum_{t=1}^T r(S_t, A_t) \biggr]. \]
Various characterizations of regret in the literature: UCRL, REGAL, SCAL, TSDE, …
Why consider regret w.r.t. the optimal policy?
Better to consider regret w.r.t. Whittle index policy
\[ \mathcal R(T; π) = \EXP^{\pi}\biggl[ T J(\mu) - \sum_{t=1}^T r(S_t, A_t) \biggr] \] where \(μ\) is the Whittle index policy.
How does this regret scale with number of arms?
Environment A: Machine maintenance model
Environment B: Link scheduling model