Scalable regret via Thompson sampling
McGill University
20 Nov, 2023
\(\def\EXP{\mathbb{E}} \def\IND{\mathbb{1}} \def\PR{\mathbb{P}} \def\ALPHABET{\mathcal} \def\S{\mathsf{S}} \def\A{\mathsf{A}} \def\REGRET{\mathcal{R}} \def\bs{\boldsymbol} \def\OO{\tilde{\mathcal{O}}}\)
Suppose the dynamics of the arms are unknown.
Our motivation: Using RMABs for operator allocation in human-robot teams.
Naive model-free RL cannot exploit the structure of RMAB … but see Avrachenkov and Borkar (2022) for using QL to learn WI.
The curse of dimensionality pops up again when we consider regret
Regret of a history-dependent and randomized learning policy \(π\): \[ \mathcal R(T; π) = \EXP^{\pi}\biggl[ T J^\star - \sum_{t=1}^T r(S_t, A_t) \biggr]. \]
Various characterizations of regret in the literature: UCRL, REGAL, SCAL, TSDE, …
Why consider regret w.r.t. the optimal policy?
Better to consider regret w.r.t. Whittle index policy
\[ \mathcal R(T; π) = \EXP^{\pi}\biggl[ T J(\mu) - \sum_{t=1}^T r(S_t, A_t) \biggr] \] where \(μ\) is the Whittle index policy.
How does this regret scale with number of arms?
Environment A: Machine maintenance model
Environment B: Link scheduling model
\(n\) arms: \(\langle \ALPHABET S^i, \ALPHABET A^i = \{0, 1\}, P^i, r^i \rangle\)
Global state \(\boldsymbol S_t = (S^1_t, \dots, S^n_t)\) and global action \(\boldsymbol A_t = (A^1_t, \dots, A^n_t)\).
An action is feasible if \(\| \boldsymbol A_t \|_1 = m\).
Rewards: Two reward models
Let \(\Pi\) denote the set of all randomized history-dependent policies.
\[ \max_{\pi \in \Pi} \liminf_{T \to ∞} \frac 1T \EXP^π\biggl[ \sum_{t=1}^T \boldsymbol r(\boldsymbol S_t, \boldsymbol A_t) \biggr] \quad \text{s.t. } \| \boldsymbol A_t \|_1 = m, \forall t \]
Replace the hard constraint \(\| \boldsymbol A_t \|_1 = m, \forall t\) by the soft constraint
\[ \limsup_{T \to ∞} \frac 1T \EXP^π \biggl[ \sum_{t=1}^T \| \boldsymbol A_t \|_1 \biggr] = m. \]
Key intuition: The Lagrange relaxation of the soft-constraint problem is equivalent to \(n\) decoupled optimization problems: for all \(i \in [n]\) \[ \max_{ π^i_{λ} \colon \ALPHABET S^i \to \{0, 1\} } \liminf_{T \to ∞} \frac 1T \EXP^{ π^i_{λ} } \biggl[ \sum_{t=1}^T \Bigl[ r^i(S^i_t, A^i_t) - λ A^i_t \Bigr] \biggr]. \]
Let \(π^i_{λ}\) be an optimal policy for this problem.
Assumption A1: For all \(i \in [n]\) and \(θ^i \in Θ^i\), the RB \(\langle \ALPHABET S^i, \{0, 1\}, P^i(θ^i), r^i \rangle\) is indexable.
Assumption A2: For \(\pmb θ = (θ^1, \dots, θ^n) \in \pmb Θ\), the Whittle index policy \(μ_{\pmb θ}\) has well-defined DP solution \((J_{\pmb θ}, \boldsymbol V_{\pmb θ})\). \(\implies J_{\pmb θ}\) independent of initial state
The ergodicity coefficient of \(\pmb{P_{θ}}\) (transitions under WI policy) is defined as \[ λ_{\pmb{P_{θ}}} = 1 - \min_{\boldsymbol{s, s' \in \ALPHABET S}} \sum_{\boldsymbol {z \in \ALPHABET S}} \min\bigl\{ \pmb{P_{θ}}(\pmb{z} \mid \pmb{s}), \pmb{P_{θ}}(\pmb{z} \mid \pmb{s'}) \bigr\}. \]
Assumption A3: There exists a \(λ^\star < 1\) such that \(\sup_{\pmb{θ} \in \pmb{Θ}} λ_{\boldsymbol{P}_{\pmb θ}} \le λ^\star\)
\(\phi^i_{t}\) denotes the posterior on \(θ^i_\star\) at time \(t\): \[ \phi^i_{t+1}(d θ) = \frac{ P^i(s^i_{t+1} \mid s^i_t, a^i_t; θ) \phi^i_t(d θ) } { \int P^i(s^i_{t+1} \mid s^i_t, a^i_t; \tilde θ) \phi^i_t(d \tilde θ) } \]
If the prior is a conjugate distribution on \(Θ^i\), then the posterior can be updated in closed form.
The exact form of the posterior and the posterior update are not important for regret analysis.
Theorem 1: Under assumptions (A1)–(A3): \[ \ALPHABET R(T; \texttt{RB-TSDE}) < 40 α \frac {R_{\max}}{1 - λ^\star} \bar \S_n \sqrt{T \log T} \]
Corollary 1: Under assumptions (A1)–(A3):
Theorem 2: Under assumptions (A1)–(A4): \[ \ALPHABET R(T; \texttt{RB-TSDE}) < 12 \max\{ n \sqrt{\bar \S_n}, \bar S_n \} \max\big\{ R_{\max}, D_{\max} \boldsymbol L_v \bigr\} \sqrt{K T \log T} \]
Corollary 2: Under assumptions (A1)–(A4):
Regret of TSDE is evaluated as follows:
\[\begin{align*} \hskip 1em & \hskip -1em \REGRET(T) = \underbrace{ \EXP\biggl[ T J_\star - \sum_{k=1}^{K_T} T_k J_k \biggr]} _{\text{regret due to sampling error $:= \REGRET_0(T)$}} \notag \\ & \underbrace{ \EXP\biggl[\sum_{k=1}^{K_T} \sum_{t=t_k}^{t_{k+1} - 1} \bs{V}_k(\bs{S}_{t+1}) - \bs{V}_k(\bs{S}_{t}) \biggr]} _{\text{regret due to time-varying policy $:= \REGRET_1(T)$}} % \notag \\ % &\hskip 4em + + \underbrace{ \EXP\biggl[\sum_{k=1}^{K_T} \sum_{t=t_k}^{t_{k+1} - 1} \bigl[ \bs{P}_k \bs{V}_k \bigr](\bs{S}_{t}) - \bs{V}_k(\bs{S}_{t+1}) \biggr]} _{\text{regret due to model mismatch $:= \REGRET_2(T)$}}. \end{align*}\]
where \(K_T =\) number of episodes until time \(T\).
Bound on \(\REGRET_0(T)\): \(\displaystyle\qquad \EXP\biggl[ T J_\star - \sum_{k=1}^{K_T} T_k J_k \biggr] \le J_{\star} \EXP[K_T]\)
Bound on \(\REGRET_1(T)\):
\[ \EXP\biggl[\sum_{k=1}^{K_T} \sum_{t=t_k}^{t_{k+1} - 1} \bs{V}_k(\bs{S}_{t+1}) - \bs{V}_k(\bs{S}_{t}) \biggr] \le \EXP\biggl[\sum_{k=1}^{K_T} \text{sp}(\bs V_k) \biggr] \]
\[ \EXP\Bigl[ \bigl[ \bs{P}_k \bs{V}_k \bigr](\bs{S}_{t}) - \bs{V}_k(\bs{S}_{t+1}) \Bigr] \le \EXP\Bigl[ \tfrac 12 \text{sp}(\bs V_k) \bigl\| \bs P_k(\cdot \mid \bs S_t, \bs A_t) - \bs P_{\star}( \cdot \mid \bs S_t, \bs A_t ) \bigr\|_{1} \Big] \]
Thus, to bound the regret, we need to bound:
General negative result. It is shown in Khun (2023) that there exist RMAB with where \(\text{sp}(\bs V_{\star}) \ge C\) for an arbitrary \(C\).
We impose (A3) to bound \(\text{sp}(\bs V_k)\). Under (A3), we average cost Bellman operator is a contraction with contraction factor \(λ^{\star}\) and we can show \[ \text{sp}(\bs V_k) \le 2 α R_{\max}/ (1 - λ^{\star}). \]
How can we verify A3? Ergodicity coefficient is upper bounded by contraction factor
\[ \bar λ_{\pmb θ} = 1 - \min_{\substack{ \bs{s, s' \in \ALPHABET S} \\ \bs {a, a' \in \ALPHABET A}(m)}} \sum_{\bs{z \in \ALPHABET S}} \min\bigl\{ \bs P_{\pmb θ}(\bs z \mid \bs s, \bs a), \bs P_{\pmb θ}(\bs z \mid \bs s', \bs a') \bigr\} \]
Similarly define \(λ^i_{θ^i}\) for an arm.
How can we verify A3? Ergodicity coefficient is upper bounded by contraction factor
\[ \bar λ_{\pmb θ} = 1 - \min_{\substack{ \bs{s, s' \in \ALPHABET S} \\ \bs {a, a' \in \ALPHABET A}(m)}} \sum_{\bs{z \in \ALPHABET S}} \min\bigl\{ \bs P_{\pmb θ}(\bs z \mid \bs s, \bs a), \bs P_{\pmb θ}(\bs z \mid \bs s', \bs a') \bigr\} \]
Similarly define \(\bar λ^i_{θ^i}\) for an arm.
Assumption (A3’). There exists a \(τ^\star\) and \(λ^\star\) such that for all \(\pmb θ \in \pmb Θ\), \[λ_{\bs P^{\color{red}{τ^*}}_{\pmb θ}} \le λ^\star.\]
Under (A3’), \[ \text{sp}(\bs V_{θ}) \le 2 τ^\star α R_{\max}/ (1 - λ^\star). \]
Doesn’t really remove potential exponential dependence on \(n\)
But may be easier to verify (e.g., reach a distinguished state in \(τ^*\) steps)
Moreover \(\text{sp}(\bs V_k) \le \bs L_v n D_{\max}\)
No dependence on ergodicity coefficient!
Why \(n^{1.5}\) and not \(n\)?
Continuous state space?
What about non-indexable models?
Other low-complexity heuristics
More generalized model (weakly coupled MDPs)
Possible to characterize regret wrt optimal policy
Nonetheless, the theory goes through, with similar bounds on regret.
RB-TSDE exploits the structure of the dynamics and maintains separate priors on \(\{ θ^i_{\star} \}_{i \in [n]}\)
Needs \(2n \S^2\) parameters rather than \(\S^n\) parameters.
Implications:
More nuanced stopping condition for each episode and, therefore, tighter bound on number of episodes
Faster convergence of estimates \(\{ \hat θ^i_k \}_{i \in [n]}\) to \(\{ θ^i_{\star} \}_{i \in [n]}\).