32 Rate of convergence for stochastic approximation

Author

Affiliation

Aditya Mahajan

McGill University

Updated

July 24, 2025

Keywords

reinforcement learning, stochastic approximation

32.1 Finite-time convergence rates for exponentially stable systems

Consider the simplest form of stochastic approximation \[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]

Suppose we satisfy the following assumptions

F1. \(θ^*\) is a solution of the equation \(f(θ) = 0\).

F2. The function \(f\) is globally Lipschitz-continous with constant \(L\), i.e., for any \(θ_1, θ_2 \in \reals^d\), \[ \NORM{f(θ_1) - f(θ_2)}_{2} \le L \NORM{θ_1 - θ_2}. \]

N1. \(\{ξ_t\}_{t \ge 0}\) is a [martingale difference sequence][MDS] with respect to \(\{ \mathcal F_t\}_{t \ge 1}\), i.e., \[ \EXP[ ξ_{t+1} | \mathcal F_t ] = 0, \text{ a.s.}, \quad \forall t \ge 1. \]

N2. The noise \(\{ξ_t\}_{t \ge 1}\) satisfies \[ \EXP[ \| ξ_{t+1} \|_2^2 \mid \mathcal F_t ] \le σ^2 \bigl(1 + \NORM{θ_t - θ^*}_2^2 \bigr) \space \text{a.s.}, \quad \forall t \ge 1 \] for some finite constant \(σ^2\).

Furthermore, we assume that there exists a Lyapunov function \(V(θ)\) that satisfies the following properties:

V1. The Lyapunov function is upper and lower bounded by quadratic functions, i.e., exist constants \(a, b > 0\) such that \[ a \| θ - θ^*\|_2^2 \le V(θ) \le b \| θ - θ^* \|_2^2, \quad \forall θ \in \reals^d. \]

V2. The Lyapunov function is \(2M\)-smooth, i.e., its gradient is \(2M\)-Lipschitz. So, for all \(θ_1, θ_2 \in \reals^d\), \[ \NORM{ \GRAD V(θ_1) - \GRAD V(θ_2) }_2 \le 2M \NORM{θ_1 - θ_2}_2. \]

V3. The Lyapunov function is exponentially stable, i.e., there exists a \(ρ > 0\) such that for any \(θ \in \reals^d\), we have \[ \dot V(θ) \coloneqq \IP{\GRAD V(θ)}{f(θ)} \le - ρ V(θ). \]

In addition, we impose the following assumption on the learning rates.

R0. The learning rates are deterministic.

For simplicity, we will state the rate of convergence for \(V(θ) = \NORM{θ - θ^*}^2\), which always satisfies (V1) and (V2). The following intermediate result is useful for rate of convergence analysis. It is an immediate application of Proposition 37.1 and has been rediscovered multiple times in the literature.

Proposition 32.1 Let \(Δ_t = \EXP[ \NORM{θ_t - θ^*}_2^2 ]\). Under assumptions (F1), (F2), (N1), (N2), (R0), and \(V(θ) = \NORM{θ - θ^*}^2\) satisfying (V3), we have \[ Δ_T \le Φ_{0,T} Δ_0 + σ^2 \sum_{t=0}^{T-1} α_t^2 Φ_{t,T} \] where \(Φ_{t,T} = \prod_{s=t}^{T-1}(1 - 2 α_s ρ + α_s^2(L^2 + σ^2))\).

Proof

As discussed in proof of Theorem 31.3, since \(V(θ)\) is \(2M\)-smooth, we have \[ V(θ + η) \le V(θ) + \IP{\GRAD V(θ)}{η} + M \NORM{η}_2^2. \]

For \(V(θ) = \NORM{θ - θ^*}_2^2\), \(M = 1\). Therefore, taking \(θ = θ_t\) and \(η = θ_{t+1} - θ_t = α_t f(θ_t) + α_t ξ_{t+1}\), gives \[\begin{align*} V(θ_{t+1}) &\le V(θ_t) + 2 α_t \IP{θ_t-θ^*}{f(θ_t)} + 2 α_t \IP{θ_t-θ^*}{ξ_{t+1}} \notag \\ &\quad + α_t^2 \bigl[ \NORM{f(θ_t)}_2^2 + \NORM{ξ_{t+1}}_2^2 + 2 \IP{f(θ_t)}{ξ_{t+1}} \bigr] \end{align*}\] Taking expectations and using (F2), (N1), (N2), and (V1), we get \[ Δ_{t+1} \le \bigl[1 - 2 α_t ρ + α_t^2 (L^2 + σ^2)\bigr] Δ_t + α_t^2 σ^2 \]

Therefore, the bound follows from Proposition 37.1.

A key idea for establishing rate of convergence is that for diminishing step sizes, at some stage \(α_t < ρ/(L^2 + σ^2)\). Once that happens, \[ 1 - 2 α_t ρ + α_t^2(L^2 + σ^2) < 1 - α_t ρ. \] Therefore, after such time \[ Φ_{t,T} \le \prod_{s=t}^{T-1}(1 - ρ α_s) \le \exp\left(-ρ \sum_{s=t}^{T-1} α_s \right). \]

To understand the rate of convergence for \(α_t = C/(t+1)^q\), we first present some elementary inequalities.

Lemma 32.1 \[ \sum_{t=0}^{T-1} \dfrac{1}{(t+1)^q} \le φ_q(T) \coloneqq \begin{cases} \dfrac{T^{1-q}}{1-q}, & q \in (0,1) \\ \log T, & q = 1 \\ < ∞, & q > 1 \end{cases} \]

Moreover for \(t < T\) \[ \frac 12 \bigl[ φ_q(T) - φ_q(t) \bigr] \le \sum_{s=t}^{T-1} \dfrac{1}{(s+1)^q} \le φ_q(T) - φ_q(t). \]

These bounds can be established by bounding sums by integrals.

The second result characterizes the rate of convergence of terms of the form of the second term in Proposition 32.1.

Lemma 32.2 Let \(α_t = C/(t+1)^q\) for some \(q \in (0,1)\) and \(β_t\) be a positive sequence such that \(β_t = C'/(t+1)^p\), for some \(p \in (0,1)\). Define \[ A_T = \sum_{t=0}^{T-1} \left[ \prod_{s=t}^{T-1} (1 - 2 ρ α_s + α_s^2(L^2 + σ^2)) \right] α_t β_t \] Let \(T_0\) be such that \[ α_t \in \left( \frac{1}{ρ(t+1)}, \min\left\{ \frac{ρ}{L^2 + σ^2}, \frac{1}{ρ} \right\} \right), \quad t \ge T_0 \] Then, \[\begin{align*} A_T &\le \exp\left( - ρC \sum_{t=T_0}^{T-1} α_t \right) \ABS{A_{T_0} - ρ^{-1} β_{T_0}} + \frac{C'p}{ρT} \bigl[ φ_{p}(T) - φ_{p}(T_0) \bigr] + \frac{1}{ρ T^p} \\ &\le \exp\left( - ρC \sum_{t=T_0}^{T-1} α_t \right) \ABS{A_{T_0} - ρ^{-1} β_{T_0}} + \frac{C'p + 1}{ρT^p} \\ &\le \mathcal{O}\left( \frac{1}{T^p} \right). \end{align*}\]

Proof

For simplicity of notation, let \(γ_t = 1 - 2 ρ α_t + α_t^2 (L^2 + σ^2)\). Observe that for \(t > T_0\), \(γ_t \le (1 - ρ α_t)\). Therefore \[\begin{align*} A_{T+1} &= \sum_{t=0}^{T-1} \left[ \prod_{s=t}^{\textcolor{red}{T}} γ_t \right] α_t β_t + α_T β_T \\ &\le (1 - ρ α_T) A_T + α_T β_T \end{align*}\]

Define an operator \(\BELLMAN_T\) as follows: \[ \BELLMAN_T(A) = (1 - ρ α_T) A + α_T β_T. \] Observe that \[ \BELLMAN_T(A^1) - \BELLMAN_T(A^2) = (1 - ρ α_T)(A^1 - A^2). \] Therefore, for any \(t \ge T_0\), \(\BELLMAN_T\) is a contraction with contraction factor \((1 - ρ α_T) < 1\). Therefore, by Banach fixed point theorem, \(\BELLMAN_T\) has a unique fixed point. It is easy to verify that \(ρ^{-1} β_T\) is a fixed point of \(\BELLMAN_T\).

From mean-value theorem, we can say that \(β_{t-1} - β_{t} \le C'p/t^{p+1}\). Therefore, \[\begin{align*} A_T - ρ^{-1} β_T &\le \BELLMAN_{T-1}(A_{T-1}) - \BELLMAN_{T-1}(ρ^{-1} β_{T-1}) \\ &\le (1 - ρ α_{T-1}) \ABS{ A_{T-1} - ρ^{-1} β_{T-1} } \\ &\le (1 - ρ α_{T-1}) \ABS{ A_{T-1} - ρ^{-1} β_{T-2} } + (1 - ρ α_{T-1}) ρ^{-1} \ABS{ β_{T-1} - β_{T-2} } \\ &\le \cdots \\ &\le \prod_{t=T_0}^{T}(1- ρ α_t)\ABS{A_{T_0} - β_{T_0}} + \sum_{t=T_0 + 1}^{T-1} \left[ \prod_{s=t}^{T-1} (1 - ρ α_s) \right] ρ^{-1} \ABS{β_{t-1} - β_t} \\ &\stackrel{(a)}\le \exp\left( - ρC \sum_{t=T_0}^{T-1} α_t \right) \ABS{A_{T_0} - β_{T_0}} + \sum_{t=T_0 + 1}^{T-1} \left[ \prod_{s=t}^{T-1} \left( 1 - \frac{1}{s+1} \right) \right] \frac{C'p}{ρ t^{p+1}} \\ &= \exp\left( - ρC \sum_{t=T_0}^{T-1} α_t \right) \ABS{A_{T_0} - β_{T_0}} + \frac{C'p}{ρT} \sum_{t={T_0 + 1}}^{T-1} \frac{1}{t^p} \\ &\le \exp\left( - ρC \sum_{t=T_0}^{T-1} α_t \right) \ABS{A_{T_0} - β_{T_0}} + \frac{C'p}{ρT} \bigl[ φ_{p}(T) - φ_{p}(T_0) \bigr] \end{align*}\] where \((a)\) follows from the assumption on \(T_0\),

We now present the asymptotic rate of convergence based on a simplified version of the analysis in Hu and Fu (2025) and Bach and Moulines (2011).

Proposition 32.2 (Rate of convergence) For simplicity, we assume that \(C\) is chosen such that \(C < ρ/(L^2 + σ^2)\). Then, we have the following:

When \(q \in (0,1)\), let \(T_0\) be such that \[ α_t \in \left( \frac{1}{ρ(t+1)}, \frac{1}{ρ} \right), \quad t \ge T_0 \] Then, for all \(t \ge T_0\), we have \[\begin{align*} Δ_T &\le \left[ Δ_0 + σ^2\left| A_{T_0} - \frac{C}{ρ(T_0+1)} \right| \right] \exp\left( - \frac{ρC}{1-q} T^{1-q} \right) + \frac{C σ^2 + 1}{ρ} \frac{1}{T^q} \\ &\le \mathcal{O}\left(\frac {1}{T^q} \right). \end{align*}\]
When \(q = 1\), we have \[ Δ_T \le Δ_0 \frac{1}{T^{ρC}} + σ^2 C^2 \frac{φ_{2-ρC}(T)}{T^{ρC}}. \]

Note that when \(2 - ρC \in (0,1)\), i.e., \(ρC \in (1,2)\), we have \[ \frac{φ_{2-ρC}(T)}{T^{ρC}} \le \frac{1}{1 - ρC} \cdot \frac{1}{T} \] while when \(2 - ρC = 1\), i.e., \(ρC = 1\) \[ \frac{φ_{2-ρC}(T)}{T^{ρC}} \le \frac{\log T}{T} \] and while \(2 - ρC > 1\), i.e., \(ρC < 1\), we have \[ \frac{φ_{2-ρC}(T)}{T^{ρC}} \le \mathcal{O}\left(\frac{1}{T^{ρC}}\right). \] Thus, the rate of convergence depends on the choice of \(C\). In particular, if \(ρC \ge 1\), then we get a convergence rate of \(1/T\) while if \(ρC < 1\), we get a slower convergence rate of \(1/T^{ρC}\).

Proof

As discussed above, for sufficiently large \(t\), \(α_t < ρ/(L^2 + σ^2)\). Therefore, from Lemma 32.2, we have the following:

When \(q \in (0,1)\), Lemma 32.1 implies that the first term in Proposition 32.1 is bounded as \[ Φ_{0,T} \le \exp( -ρC φ_q(T) ) = \exp\left( - \frac{ρC}{1-q} T^{1-q} \right). \] Furthermore, from Lemma 32.2 with \(p=1\), we get that the second term in Proposition 32.1 is bounded as \[ σ^2 \left| A_{T_0} - \frac{C}{ρ(T_0+1)} \right| \exp\left( - \frac{ρC}{1-q} T^{1-q} \right) + \frac{σ^2C + 1}{ρ} \frac{1}{T^q} \] where we have ignored the negative term.
When \(q = 1\), we have \[\begin{align*} Φ_{t,T} &= \prod_{s=t}^{T-1}(1 - 2 α_s ρ + α_s^2 L^2) \le \prod_{s=t}^{T-1}(1 - α_s ρ) \le \exp\left(-ρC \sum_{s=t}^{T-1} \frac{1}{s+1} \right) \\ &\le \exp\left( - ρC \log\left(\frac{T}{t+1}\right)\right) = \left( \frac{t+1}{T} \right)^{ρC} \end{align*}\]

Furthermore, we have \[ \sum_{t=0}^{T-1} α_t^2 Φ_{t,T} = \frac{C^2}{T^{ρC}}\sum_{t=0}^{T-1} \frac{1}{(t+1)^{2 - ρC}} = \frac{C^2}{T^{ρC}} φ_{2 - ρC}(T) \]

Substituting these to the individual terms in Proposition 32.1 gives us the bound.

Remarks

In the analysis above, we start with the assumption that \(C\) is small enough such that \(α_t < ρ/(L^2 + σ^2)\). If \(C\) is larger than that, then from time \(t=0\) upto \(τ = \inf\{ t : α_t < ρ/(L^2 + σ^2) \}\), the error may grow polynomially. After this initial growth period, it starts decaying at a rate that depends on \(ρC\).
The result of Proposition 32.2 characterizes the rate of convergence in mean-squared sense. From Jensen’s inequality, we get \[ \EXP[ \NORM{θ_t - θ^*}_2 ] \le \sqrt{ \EXP[ \NORM{θ_t - θ^*}_2^2 ] } \le \begin{dcases} \mathcal{O}\left(\dfrac{1}{T^{q/2}}\right), & q \in (0,1) \\ \mathcal{O}\left(\dfrac{1}{\sqrt{T}}\right), & q = 1 \text{ and } ρC > 1 \end{dcases} \]

32.2 Using averaging to improve the rate of convergence

It is possible to improve the rate of convergence by averaging. The standard method to do so is called Polyak-Ruppert averaging, which takes the Cesaro sum of the iterates \[ \hat Θ_T = \frac{1}{T+1} \sum_{t=0}^{T} θ_t \]

To be completed

32.3 Rates of convergence when system is not exponentially stable

We now consider the setting when the Lyapunov function is not exponentially stable. However, we assume that the following assumption holds:

V3’. There exists a \(ρ > 0\) and \(c > 0\) such that for any \(θ \in \reals^d\), we have \[ \dot V(θ) \coloneqq \IP{\GRAD V(θ)}{f(θ)} \le c - ρ V(θ). \]

In this setting an alternative characterization on the rate of convergence (for stochastic gradient descent) was provided in Karimi et al. (2019), which leverages an idea of random stopping due to Nemirovski et al. (2009).

Proposition 32.3 Suppose we run stochastic approximation for a random time \(τ \in \{0, 1, \dots, T\}\), where \[ \PR(τ = t) = \frac{α_t}{α_0 + \cdots + α_T}. \]

Let \(Δ_t = \EXP[ \NORM{θ_t - θ^*}^2_2 ]\). Under assumptions (F1), (F2), (N1), (N2), (R0), and \(V(θ) = \NORM{θ - θ^*}^2\) satisfying (V3’), and \(α_t < ρ/(L^2 + α^2)\), we have \[ \EXP[ Δ_{τ} ] = \dfrac{ \sum_{t=0}^{T} α_t Δ_t } {\sum_{t=0}^T α_t } \le \dfrac{ Δ_{T+1} - Δ_0 + σ^2 \sum_{t=0}^{T} α_t^2 } {\sum_{t=0}^T α_t } + c. \]

Take \(α_t = C/(t+1)^{1/2}\), where \(C < ρ/(L^2 + σ^2)\), we get that the right hand side is bounded by \[ 2 \frac{(Δ_{T+1} - Δ_0) + σ^2 C^2 \log T}{C\sqrt{T}} + c \le \mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right). \]

Proof

As argued in the proof of Proposition 32.2, under the stated assumptions we have \[ \EXP[\NORM{θ_{t+1}-\theta^*}_2^2] \le \bigl[1 - 2 α_t ρ + α_t^2 (L^2 + σ^2)\bigr] \EXP[\NORM{θ_t - θ^*}_2^2] + \textcolor{red}{α_t c} + α_t^2 σ^2 \] Furthermore, since \(α_t < ρ/(L^2 + σ^2)\), we have \(1 - 2 α_t ρ + α_t^2 (L^2 + σ^2) < 1 - α_t ρ\). Thus, \[ \EXP[\NORM{θ_{t+1}-\theta^*}_2^2] \le \bigl[1 - α_t ρ \bigr] \EXP[\NORM{θ_t - θ^*}_2^2] + \textcolor{red}{α_t c} + α_t^2 σ^2 \] Then, by rearranging terms, we can write the above equation as \[ α_t ρ Δ_t \le Δ_{t+1} - Δ_t + \textcolor{red}{α_t c} + α_t^2 σ^2. \] Summing from \(t=0\) to \(t=T\), and dividing by \(\sum_{t=0}^T α_t\), we get the first bound.

The second bound follows immediately from Lemma 32.1.

Remarks

The convegence rate of \(\mathcal{O}(1/\sqrt{T})\) of Proposition 32.3 is worse than the best rate of \(\mathcal{O}(1/T)\) obtained in Proposition 32.2. However, the key advantage is that they are guaranteed without any global asymptotic stability assumptions. When being used in projection based stochastic approximation, one can bound \(Δ_{T+1} - Δ_0\) in terms of the diameter of set where iterates are projected.
The analysis also works for constant step sizes. In particular, if we pick \(α_t = C/\sqrt{\textcolor{red}{T}}\) (note \(α_t\) doesn’t depend on \(t\)), then the right hand side of the bound is \[ \frac{(Δ_{T+1} - Δ_0) + σ^2 C^2}{C\sqrt{T}} + c. \]

32.4 Connection to reinforcement learning

One of the most common methods of learning is temporal difference (TD) learning, which may be viewed as an iterative method for solving fixed point problems via noisy samples. In the simplest setting, consider a fixed point equation \[ θ = H(θ) \] where \(H \colon \reals^d \to \reals^d\) is a contraction with respect to some norm. In many learning settings, we only have access to noisy evaluations of \(H(θ)\), so we can try to find the fixed point via the following stochastic approximation iteration: \[ θ_{t+1} = θ_t + α_t\bigl[ H(θ_t) - θ_t + ξ_{t+1} \bigr]. \]

From basic stochastic approximation theory, we know that the above iteration will converge to the equilibrium solution of the ODE \[\begin{equation}\label{eq:contraction-ode} \dot θ = H(θ) - θ. \end{equation}\]

We start by the following result from Borkar and Soumyanatha (1997).

Lemma 32.3 Suppose the following condition holds:

H1. There exists a \(p \in (1, ∞]\) such that \(H \colon \reals^d \to \reals^d\) is a \(p\)-norm contraction with contraction factor \(γ\), i.e., for any \(θ_1, θ_2 \in \reals^d\), \[ \NORM{ H(θ_1) - H(θ_2) }_{p} \le γ \NORM{ θ_1 - θ_2 }. \]

Then the ODE \(\eqref{eq:contraction-ode}\) is globally asymptotically stable. In particular, \[ V(θ) = \NORM{θ - θ^*}_{p} \] is a Lyapunov function for the ODE and \[ \frac{d V(θ)}{dt} \le -(1 - γ) V(θ). \]

Proof

Let \(\text{sgn}(θ)\) denote the sign function, i.e., \(\text{sgn} = +1\) if \(θ > 0\), is \(0\) if \(θ = 0\), and is \(-1\) is \(θ < 0\). First assume that \(p \in (1, ∞)\). By the chain rule

\[\begin{align*} \frac{dV(θ)}{dt} &= \sum_{i=1}^d \frac{∂V}{∂θ_i} \frac{dθ_i}{dt} \\ &= \biggl[ \sum_{j = 1}^d | θ_i - θ^*_i |^p \biggr]^{(p-1)/p} \sum_{i=1}^d \text{sign}(θ_i - θ^*_i) | θ_i - θ^*_i |^{1-p} \bigl( H_i(θ) - θ_i \bigr) \end{align*}\] Note that the term outside the summation is \(\|θ - θ^*\|_{p}^{p-1}\). Moreover, we can write \(H_i(θ) - θ_i\) as \(H_i(θ) - H_i(θ^*) - (θ_i - θ^*_i)\) (because \(H(θ^*) = θ^*\). Making these substitutions in the above equation, we get \[ \frac{dV(θ)}{dt} = \| θ - θ^*\|_p^{1-p} \biggl[ \sum_{i=1}^d \text{sgn}(θ_i - θ^*_i) | θ_i - θ^*_i|^{p-1} \bigl( H_i(θ) - H_i(θ^*) \bigr) - \| θ - θ^* \|_{p} \] By Holder’s inequality \[\begin{align*} \sum_{i=1}^d & \text{sgn}(θ_i - θ^*_i) | θ_i - θ^*_i|^{p-1} \bigl( H_i(θ) - H_i(θ^*) \bigr) - \| θ - θ^* \|_{p} \\ &\le \biggl[ \sum_{i=1}^d \bigl(\text{sgn}(θ_i - θ^*_i) | θ_i - θ^*_i|^{p-1} \bigr)^{p/(p-1)} \biggr]^{(p-1)/p} \biggl[ \sum_{i=1}^d \bigl( H_i(θ) - H_i(θ^*) \bigr)^p \biggr]^{1/p} \\ &= \| θ - θ^* \|_{[}^{p-1}\, \| H(θ) - H(θ^*) \|_{p}. \end{align*}\] Substituting this in the previous equation, we get \[ \frac{dV(θ)}{dt} \le \| H(θ) - H(θ^*)\|_p - \| θ - θ^* \|_{p} \le - (1 - γ) \NORM{ θ - θ^*}_{p} \] which establishes the result for \(p \in (1, ∞)\).

Note that we can substitute \(p=∞\) in the above expression because \(\|z\|_p \to \|z \|_{∞}\) uniformly on compact sets.

As argued in the proof of Theorem 31.3 and Proposition 32.1, the key step in the analysis is to establish that \[ V(θ + η) \le V(θ) + \IP{\GRAD V(θ)}{η} + M \NORM{η}_2^2. \]

It can be shown that \(V(θ) = \NORM{θ - θ^*}_p\) with \(p \in (1,∞)\) is \((p-1)\) smooth. However, this is not the case for \(p = ∞\). In most RL settings, the contraction is with respect to \(\NORM{⋅}_{∞}\), so the rates of convergence derived above are not directly applicable.

Let \[ M_{λ}(θ) = \min_{φ \in \reals^d} \left\{ \NORM{φ}_{∞} + \frac{1}{2 λ} \NORM{θ - φ}_2^2 \right\} \] denote the :Moreau envelop of \(\NORM{⋅}_{∞}\). A fundamental property of Moreau envelopes is that they close to the original function and \(1/λ\)-smooth (see (Chen et al. 2020, Lemma 2.1)). Thus, the above results are applicable (with minor changes in constants).

Exercises

Notes

Lemma 32.2 is from Hu et al. (2025). The first part of Proposition 32.2 is adapted from Hu and Fu (2025), where it was stated for stochastic gradient descent with biased gradients. Similar analysis is presented in Bach and Moulines (2011), with a more nuanced characterization of the “growth term” \(A_{T_0}\). The second part of Proposition 32.2 is adapted from Bach and Moulines (2011), where it was stated for stochastic gradient descent. We have simplified the analysis by assuming that \(α_t < ρ/(L^2 + σ^2)\). Bach and Moulines (2011) do not make this assumption and also characterize the rate of expansion until the time when \(α_t\) becomes less than \(ρ/(L^2 + σ^2)\).

Bach and Moulines (2011) show that for \(q \in (1/2, 1)\), the rate of convergence improves when using Polyak-Ruppert averaging.

The idea of running stochastic approximation for a random amount of time is based on Nemirovski et al. (2009), who proposed it for stochastic gradient descent.

The discussion in Section 32.4 is based on Chen et al. (2020).