Reinforcement learning in restless bandits

Scalable regret via Thompson sampling

Aditya Mahajan

McGill University

20 Nov, 2023

Motivation

Restless multi-armed bandits are a popular framework for solving resource allocation and scheduling problems.

Salient features

Whittle index policy avoids the curse of dimensionality (in # of arms)
It is a heuristic, but theoretically motivated and performs well in a large class of models. See Gast et al. (2023) for asymptotic optimality guarantees.

Limitation

Assumes a perfect knowledge of the system model!

$\def\EXP{\mathbb{E}} \def\IND{\mathbb{1}} \def\PR{\mathbb{P}} \def\ALPHABET{\mathcal} \def\S{\mathsf{S}} \def\A{\mathsf{A}} \def\REGRET{\mathcal{R}} \def\bs{\boldsymbol} \def\OO{\tilde{\mathcal{O}}}$

RMAB with unknown system model

Suppose the dynamics of the arms are unknown.
Our motivation: Using RMABs for operator allocation in human-robot teams.

Can we use a reinforcement learning (RL) to learn good policies?

Two challenges

Naive model-free RL cannot exploit the structure of RMAB … but see Avrachenkov and Borkar (2022) for using QL to learn WI.
The curse of dimensionality pops up again when we consider regret

Review: Regret of RL in MDPs

Consider average cost MDP $\langle \mathcal{S}, \mathcal A, P, r\rangle$ with optimal performance: \[ J^\star = \sup_{ \pi \in \Pi} \lim_{T \to \infty} \frac 1T \EXP^{\pi} \biggl[ \sum_{t=1}^T r(S_t, A_t) \biggr] \]

Regret of a history-dependent and randomized learning policy $π$: \[ \mathcal R(T; π) = \EXP^{\pi}\biggl[ T J^\star - \sum_{t=1}^T r(S_t, A_t) \biggr]. \]
Various characterizations of regret in the literature: UCRL, REGAL, SCAL, TSDE, …

Review: Regret of RL in MDPs

Lower bound $\tilde Ω(\underline κ \sqrt{\S\A T})$
Upper bound $\OO(\bar κ S \sqrt{\A T})$
$\underline κ$ and $\bar κ$ are model dependent constants.

Implication for RMAB

Consider an RMAB with $n$ arms, at most $m$ can be activated \[ \text{Regret} = \OO\biggl( \prod_{i=1}^n \S_i \sqrt{ \binom{n}{m} T } \biggr) \]
Regret upper bound exponential in $n$ (curse of dimensionality).

Are we doomed?

Does the regret definition make sense?

Why consider regret w.r.t. the optimal policy?
Better to consider regret w.r.t. Whittle index policy

\[ \mathcal R(T; π) = \EXP^{\pi}\biggl[ T J(\mu) - \sum_{t=1}^T r(S_t, A_t) \biggr] \] where $μ$ is the Whittle index policy.
How does this regret scale with number of arms?

Let’s check that experimentally

Environment A: Machine maintenance model
- Machines deteriorate stochastically over time
- Incur a cost that depends on the state of the machine
- One repairman, who can replace a machine at a cost.

Environment B: Link scheduling model
- Multiple queues communicating on a common channel
- Only one queue can be scheduled
- Others incur a holding cost depending on queue length

Regret of QWI on Env A

Regret of QWI on Env B

Implication

Whittle index aware RL algos do well in practice.
The general purpose regret bound for MDPs is too loose.

Can we develop Whittle index aware regret bounds?

This talk: Regret of Thompson sampling for RMAB
- Why Thompson sampling? Easy to implement (no hyperparameter tuning) and easy to analyze. Contemporary work by Gast et al. (2022) for rested MAB

System Model

$n$ arms: $\langle \ALPHABET S^i, \ALPHABET A^i = \{0, 1\}, P^i, r^i \rangle$
Global state $\boldsymbol S_t = (S^1_t, \dots, S^n_t)$ and global action $\boldsymbol A_t = (A^1_t, \dots, A^n_t)$.
An action is feasible if $\| \boldsymbol A_t \|_1 = m$.

Dynamics: $\displaystyle \PR(\boldsymbol S_{t+1} \mid \boldsymbol S_{0:t}, \boldsymbol A_{0:t}) = \prod_{i = 1}^n P^i(S^i_{t+1} \mid S^i_t, A^i_t).$

Rewards: Two reward models
- Model A: All arms yield reward. (queueing ntw, machine maintenance, etc.)
- Model B: Only active arms yield reward. (rested MAB, cognitive radios)

The optimization problem

Let $\Pi$ denote the set of all randomized history-dependent policies.

\[ \max_{\pi \in \Pi} \liminf_{T \to ∞} \frac 1T \EXP^π\biggl[ \sum_{t=1}^T \boldsymbol r(\boldsymbol S_t, \boldsymbol A_t) \biggr] \quad \text{s.t. } \| \boldsymbol A_t \|_1 = m, \forall t \]

Whittle’s relaxation

Replace the hard constraint $\| \boldsymbol A_t \|_1 = m, \forall t$ by the soft constraint

\[ \limsup_{T \to ∞} \frac 1T \EXP^π \biggl[ \sum_{t=1}^T \| \boldsymbol A_t \|_1 \biggr] = m. \]

Whittle index policy

Key intuition: The Lagrange relaxation of the soft-constraint problem is equivalent to $n$ decoupled optimization problems: for all $i \in [n]$ \[ \max_{ π^i_{λ} \colon \ALPHABET S^i \to \{0, 1\} } \liminf_{T \to ∞} \frac 1T \EXP^{ π^i_{λ} } \biggl[ \sum_{t=1}^T \Bigl[ r^i(S^i_t, A^i_t) - λ A^i_t \Bigr] \biggr]. \]

Let $π^i_{λ}$ be an optimal policy for this problem.

Indexability and Whittle index

Arm $i$ is indexable if $\ALPHABET W^i_{λ} = \{s \in \ALPHABET S^i : π^i_{λ}(s) = 0 \}$ is non-decreasing in $λ$.
Whittle index: $w^i(s) = \inf \{ λ \in \mathbb{R} : s \in \ALPHABET W^i_{λ} \}$.
Whittle index policy: Play the arms with the $m$ highest Whittle indices

The learning setup

Let $θ^i_\star \in Θ^i$ denote the unknown parameters of $P^i(\cdot| \cdot, \cdot)$, where $Θ^i$ is a known compact set.

Assumption A1: For all $i \in [n]$ and $θ^i \in Θ^i$, the RB $\langle \ALPHABET S^i, \{0, 1\}, P^i(θ^i), r^i \rangle$ is indexable.
Assumption A2: For $\pmb θ = (θ^1, \dots, θ^n) \in \pmb Θ$, the Whittle index policy $μ_{\pmb θ}$ has well-defined DP solution $(J_{\pmb θ}, \boldsymbol V_{\pmb θ})$. $\implies J_{\pmb θ}$ independent of initial state
The ergodicity coefficient of $\pmb{P_{θ}}$ (transitions under WI policy) is defined as \[ λ_{\pmb{P_{θ}}} = 1 - \min_{\boldsymbol{s, s' \in \ALPHABET S}} \sum_{\boldsymbol {z \in \ALPHABET S}} \min\bigl\{ \pmb{P_{θ}}(\pmb{z} \mid \pmb{s}), \pmb{P_{θ}}(\pmb{z} \mid \pmb{s'}) \bigr\}. \]
Assumption A3: There exists a $λ^\star < 1$ such that $\sup_{\pmb{θ} \in \pmb{Θ}} λ_{\boldsymbol{P}_{\pmb θ}} \le λ^\star$

Bayesian Learning framework

$\{θ^i_{\star}\}_{i \in [n]}$ are independent random variables.
$\phi^i_1$ denotes the prior on $θ^i$.

$\phi^i_{t}$ denotes the posterior on $θ^i_\star$ at time $t$: \[ \phi^i_{t+1}(d θ) = \frac{ P^i(s^i_{t+1} \mid s^i_t, a^i_t; θ) \phi^i_t(d θ) } { \int P^i(s^i_{t+1} \mid s^i_t, a^i_t; \tilde θ) \phi^i_t(d \tilde θ) } \]
If the prior is a conjugate distribution on $Θ^i$, then the posterior can be updated in closed form.
The exact form of the posterior and the posterior update are not important for regret analysis.

RB-TSDE Algorithm

Based on Thompson sampling with dynamic episodes (Ouyang et al. 2017).

RB-TSDE Algorithm

Based on Thompson sampling with dynamic episodes (Ouyang et al. 2017).

RB-TSDE Algorithm

Based on Thompson sampling with dynamic episodes (Ouyang et al. 2017).

RB-TSDE Algorithm

Based on Thompson sampling with dynamic episodes (Ouyang et al. 2017).

Regret of RB-TSDE

Theorem 1: Under assumptions (A1)–(A3): \[ \ALPHABET R(T; \texttt{RB-TSDE}) < 40 α \frac {R_{\max}}{1 - λ^\star} \bar \S_n \sqrt{T \log T} \]

$α = n$ for Model A and $α = m$ for Model B
$R_{\max} = \sup_{i \in [n]} \| r^i \|_{∞}$
$\bar \S_n = \sum_{i \in [n]} |\ALPHABET S^i|$

Corollary 1: Under assumptions (A1)–(A3):

Regret of Model A: $\OO(n^2 \sqrt{T})$ and regret of Model B: $\OO(n^{1.5} \sqrt{mT})$.

Tighter regret characterization

Assumption A4: For each $\pmb θ \in \pmb Θ$, $\boldsymbol V_{\pmb θ}$ is $\boldsymbol L_v$-Lipschitz.

Theorem 2: Under assumptions (A1)–(A4): \[ \ALPHABET R(T; \texttt{RB-TSDE}) < 12 \max\{ n \sqrt{\bar \S_n}, \bar S_n \} \max\big\{ R_{\max}, D_{\max} \boldsymbol L_v \bigr\} \sqrt{K T \log T} \]

$D_{\max} = \max_{i \in [n]} \text{diam}(\ALPHABET S^i)$.
$K$ is a universal constant

Corollary 2: Under assumptions (A1)–(A4):

Regret of Model A and B: $\OO(n^{1.5} \sqrt{T})$

Does the theory match the experiments?

Regret of RB-TSDE on Env A

Regret of RB-TSDE on Env B

Comparison of RB-TSDE and QWI for Env A

Comparison of RB-TSDE and QWI for Env B

Discussion of the results

Why do we need Assumption A3?

Regret of TSDE is evaluated as follows:

\[\begin{align*} \hskip 1em & \hskip -1em \REGRET(T) = \underbrace{ \EXP\biggl[ T J_\star - \sum_{k=1}^{K_T} T_k J_k \biggr]} _{\text{regret due to sampling error $:= \REGRET_0(T)$}} \notag \\ & \underbrace{ \EXP\biggl[\sum_{k=1}^{K_T} \sum_{t=t_k}^{t_{k+1} - 1} \bs{V}_k(\bs{S}_{t+1}) - \bs{V}_k(\bs{S}_{t}) \biggr]} _{\text{regret due to time-varying policy $:= \REGRET_1(T)$}} % \notag \\ % &\hskip 4em + + \underbrace{ \EXP\biggl[\sum_{k=1}^{K_T} \sum_{t=t_k}^{t_{k+1} - 1} \bigl[ \bs{P}_k \bs{V}_k \bigr](\bs{S}_{t}) - \bs{V}_k(\bs{S}_{t+1}) \biggr]} _{\text{regret due to model mismatch $:= \REGRET_2(T)$}}. \end{align*}\]

where $K_T =$ number of episodes until time $T$.

Why do we need Assumption A3?

Bound on $\REGRET_0(T)$: $\displaystyle\qquad \EXP\biggl[ T J_\star - \sum_{k=1}^{K_T} T_k J_k \biggr] \le J_{\star} \EXP[K_T]$
Bound on $\REGRET_1(T)$:

\[ \EXP\biggl[\sum_{k=1}^{K_T} \sum_{t=t_k}^{t_{k+1} - 1} \bs{V}_k(\bs{S}_{t+1}) - \bs{V}_k(\bs{S}_{t}) \biggr] \le \EXP\biggl[\sum_{k=1}^{K_T} \text{sp}(\bs V_k) \biggr] \]

Bound on inner term in $\REGRET_2(T)$:

\[ \EXP\Bigl[ \bigl[ \bs{P}_k \bs{V}_k \bigr](\bs{S}_{t}) - \bs{V}_k(\bs{S}_{t+1}) \Bigr] \le \EXP\Bigl[ \tfrac 12 \text{sp}(\bs V_k) \bigl\| \bs P_k(\cdot \mid \bs S_t, \bs A_t) - \bs P_{\star}( \cdot \mid \bs S_t, \bs A_t ) \bigr\|_{1} \Big] \]

Why do we need Assumption A3?

Thus, to bound the regret, we need to bound:
- $J_{\star}$, $\quad\text{sp}(\bs V_{k}),\quad$ and $\quad\bigl\| \bs P_k(\cdot \mid \bs S_t, \bs A_t) - \bs P_{\star}( \cdot \mid \bs S_t, \bs A_t ) \bigr\|_{1}$

General negative result. It is shown in Khun (2023) that there exist RMAB with where $\text{sp}(\bs V_{\star}) \ge C$ for an arbitrary $C$.
We impose (A3) to bound $\text{sp}(\bs V_k)$. Under (A3), we average cost Bellman operator is a contraction with contraction factor $λ^{\star}$ and we can show \[ \text{sp}(\bs V_k) \le 2 α R_{\max}/ (1 - λ^{\star}). \]

Is Assumption A3 sufficient?

How can we verify A3? Ergodicity coefficient is upper bounded by contraction factor

\[ \bar λ_{\pmb θ} = 1 - \min_{\substack{ \bs{s, s' \in \ALPHABET S} \\ \bs {a, a' \in \ALPHABET A}(m)}} \sum_{\bs{z \in \ALPHABET S}} \min\bigl\{ \bs P_{\pmb θ}(\bs z \mid \bs s, \bs a), \bs P_{\pmb θ}(\bs z \mid \bs s', \bs a') \bigr\} \]
Similarly define $λ^i_{θ^i}$ for an arm.

Easy to see that if $λ^i_{θ^i} < 1$ for all $i \in [n]$, then $\bar λ_{\bs θ} < 1$.
Sufficient condition. For every arm, there exists a distinguished state that can be reached from all state-action pairs with positive probability.

Is Assumption A3 sufficient?

How can we verify A3? Ergodicity coefficient is upper bounded by contraction factor

\[ \bar λ_{\pmb θ} = 1 - \min_{\substack{ \bs{s, s' \in \ALPHABET S} \\ \bs {a, a' \in \ALPHABET A}(m)}} \sum_{\bs{z \in \ALPHABET S}} \min\bigl\{ \bs P_{\pmb θ}(\bs z \mid \bs s, \bs a), \bs P_{\pmb θ}(\bs z \mid \bs s', \bs a') \bigr\} \]
Similarly define $\bar λ^i_{θ^i}$ for an arm.

Easy to see that if $λ^i_{θ^i} < 1$ for all $i \in [n]$, then $\bar λ_{\bs θ} < 1$.
Negative result It is shown by Khun (2023) that if $\bar λ^i_{θ^i} < 1 - ε$ for all $i \in [n]$, then $\bar λ_{\pmb θ} \le 1 - ε^n$.

What does this mean for regret?

We have shown that $\REGRET(T) = \OO\biggl(\dfrac{1}{1 - λ^\star}\biggr)$

If contraction factor of each bandit is less than $ε$, then \[ \frac{1}{1 - λ^\star} \le \frac{1}{1 - \bar λ^\star} \le \biggl( \frac {1}{ε} \biggr)^n \]

This doesn’t mean that $\REGRET(T)$ is exponential in $n$.
But we still don’t know the exact dependence on $n$.

Relaxation of Assumption A3

Assumption (A3’). There exists a $τ^\star$ and $λ^\star$ such that for all $\pmb θ \in \pmb Θ$, \[λ_{\bs P^{\color{red}{τ^*}}_{\pmb θ}} \le λ^\star.\]
Under (A3’), \[ \text{sp}(\bs V_{θ}) \le 2 τ^\star α R_{\max}/ (1 - λ^\star). \]
Doesn’t really remove potential exponential dependence on $n$
But may be easier to verify (e.g., reach a distinguished state in $τ^*$ steps)

Circumvent using Assumption A4?

Simplified the bound on the inner term in $\REGRET_2(T)$: \[ \EXP\Bigl[ \bigl[ \bs{P}_k \bs{V}_k \bigr](\bs{S}_{t}) - \bs{V}_k(\bs{S}_{t+1}) \Bigr] \le \EXP\Bigl[ \bs L_v \ALPHABET K(\bs P_k(\cdot \mid \bs S_t, \bs A_t), \bs P_{\star}( \cdot \mid \bs S_t, \bs A_t )) \Big] \] There is no dependence on span!

Moreover $\text{sp}(\bs V_k) \le \bs L_v n D_{\max}$
No dependence on ergodicity coefficient!

Provide some sufficient conditions to verify (A4), but hard in general.

Future directions

Why $n^{1.5}$ and not $n$?
Continuous state space?
What about non-indexable models?

Other low-complexity heuristics
More generalized model (weakly coupled MDPs)

Thank you

What about regret wrt to optimal policy?

We characterized regret wrt Whittle index policy

Possible to characterize regret wrt optimal policy
- Need to compute the optimal policy at the beginning of each episode.
- Intractable except for small models.
Nonetheless, the theory goes through, with similar bounds on regret.

Why is regret of RB-TSDE better than TSDE?

TSDE maintains a joint prior on $\pmb θ_{\star} = (θ^1_\star, \dots, θ^n_\star)$.

RB-TSDE exploits the structure of the dynamics and maintains separate priors on $\{ θ^i_{\star} \}_{i \in [n]}$
Needs $2n \S^2$ parameters rather than $\S^n$ parameters.

Implications:
- More nuanced stopping condition for each episode and, therefore, tighter bound on number of episodes
- Faster convergence of estimates $\{ \hat θ^i_k \}_{i \in [n]}$ to $\{ θ^i_{\star} \}_{i \in [n]}$.

References

Avrachenkov, K.E. and Borkar, V.S. 2022. Whittle index based q-learning for restless bandits with average reward. Automatica 139, 110186. DOI: 10.1016/j.automatica.2022.110186.

Gast, N., Gaujal, B., and Khun, K. 2022. Learning algorithms for Markovian bandits: Is posterior sampling more scalable than optimism? Transactions on Machine Learning Research Journal. Available at: https://inria.hal.science/hal-03262006.

Gast, N., Gaujal, B., and Yan, C. 2023. Exponential asymptotic optimality of whittle index policy. Queueing Systems 104, 1–2, 107–150. DOI: 10.1007/s11134-023-09875-x.

Khun, K. 2023. Indexability and learning algorithms for Markovian bandits.

Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. 2017. Learning unknown markov decision processes: A Thompson sampling approach. Neurips.