31 Stochastic approximation
reinforcement learning, stochastic approximation
Suppose \(f \colon \reals^d \to \reals^d\) and it is desired to find a solution \(θ^*\) to the equation \(f(θ) = 0\). There are many methods for determining the value of \(θ\) by successive approximation where we start with an initial guess \(θ_0\) and then recursively obtain a new value \(θ_{t+1}\) as a function of the previously obtained \(θ_0, \dots, θ_{k}\), the values \(f(θ_1), \dots, f(θ_{t})\), and possibly those of the derivatives \(f'(θ_0), \dots, f'(θ_{t})\), etc. If \[ \lim_{t \to ∞} θ_t = θ^*, \] irrespective of the initial condition \(θ_0\), then the successive approximation method is effective.
In many applications, the function \(f\) may be unknown so it is not possible to obtain the value \(f(θ)\), but it may be possible to conduct an experiment to get the value of \(f(θ)\) with noise. Stochastic approximation refers to recursive algorithms of the form \[ \begin{equation} \label{eq:SA} θ_{t+1} = θ_t + α_t[ f(θ_t) + ξ_{t+1} ], \quad t \ge 0 \end{equation} \] where \(\{α_t\}_{t \ge 0}\) is a sequence of positive numbers and \(\{ξ_t\}_{t \ge 0}\) is a noise sequence.
We are interested in the limit behavior of the sequence \(\{θ_t\}_{t \ge 0}\).
Example 31.1 Consider an initially empty urn to which black or red balls are added one at a time. Let \(r_t\) denote the number of red balls at time \(t\) and \(θ_t = r_t/t\) denote the fraction of red balls at time \(t\). Suppose that \[ \PR(\text{next ball is red} \mid \text{all past}) = p(\text{fraction of red balls}) \] where \(p \colon [0,1] \to [0,1]\) is prespecified.
In this model, the sequence \(\{r_t\}_{t \ge 1}\) follows the recursion: \[ r_{t+1} = r_t + w_{t+1} \] where \(w_{t+1} = \IND\{(t+1)\text{-st ball is red}\}\). Therefore, \[ θ_{t+1} = θ_t + \frac{1}{t+1}( w_{t+1} - θ_t ) \] with \(θ_0 = 0\). We can rewrite the above equation as \[ θ_{t+1} = θ_t + \frac{1}{t+1}\bigl[ (p(θ_t) - θ_t ) + (w_{t+1} - p(θ_t)) \bigr] \] Define \(ξ_{t+1} = w_{t+1} - p(θ_t)\). Note that \(\{ξ_t\}_{t \ge 1}\) is Martingale difference sequence, i.e., \(\EXP[ ξ_{t+1} \mid θ_0, w_{1:t} ] = 0\). Thus the above equation is of the form \eqref{eq:SA} with \(f(θ_t) = p(θ_t) - θ_t\).
The key idea behind stochastic approximation is that under appropriate conditions, the iteration \eqref{eq:SA} almost surely converges to the equilibrium point of the ODE \[ \begin{equation} \label{eq:ODE} \dot θ(t) = f(θ(t)) \end{equation} \] with initial conditions \(θ(0) = θ_0\). For instance, for Example 31.1, this means that under appropriate conditions, the discrete-time iterates \(\{θ_t\}_{t \ge 0}\) converge to the solution of the ODE \eqref{eq:ODE}. In particular, they would converge to the equilibrium set \(H = \{ θ : p(θ) = θ \}\).
Suppose that \(p(θ)\) is such that there exists a \(θ_\circ\) such that \(p(θ) > θ\) for \(θ \in (θ_\circ, 1)\) and \(p(θ) < θ\) for \(θ \in (0, θ_\circ)\). See Figure 31.1, for example. Then, the set of equilibrium points are \(H = \{0, θ_\circ, 1\}\). Out of these \(\{0, 1\}\) are stable and \(θ_\circ\) is unstable. The stochastic approximation theory shows that the iterations \(\eqref{eq:SA}\) will converge to either \(0\) or \(1\). Thus, along each sample path, the iterates \(\{θ_t\}_{t \ge 1}\) will be ‘locked into’ one color which will dominate.
Multiple runs starting with different initial conditions are shown in Figure 31.2. Note that the discrete-time iterates are stochastic and oscillate, as expected. However, the eventually converge to the stable equilibrium points of the ODE \(\eqref{eq:ODE}\)
Random.seed!(42)
= 10 # Number of time steps
T = 25 # Number of different runs
M = zeros(T)
α for t in 1:T
= 1/(t+1)
α[t] end
= zeros(T,M)
θ for m in 1:M
1,m] = mod((m/M + rand()/M),1)
θ[for t in 1:T-1
+1,m] = θ[t,m] + α[t]*(rand(Bernoulli(p(θ[t,m]))) - θ[t,m])
θ[tend
end
We now summarize the sufficient conditions of convergence.
31.1 List of assumptions
Let \(\mathcal F_t = σ(θ_{1:t}, ξ_{1:t}, α_{1:t})\). We state the following set of assumptions but note that not every assumption is needed for every result.
31.1.1 Assumptions on the function \(f\)
F1. \(θ^*\) is a solution of the equation \(f(θ) = 0\).
F1’. \(θ^*\) is the unique solution of the equation \(f(θ) = 0\).
F2. The function \(f\) is globally Lipschitz-continuous with constant \(L\), i.e., for any \(θ_1, θ_2 \in \reals^d\), \[ \| f(θ_1) - f(θ_2) \|_{2} \le L \| θ_1 - θ_2 \|_2. \] F2’. The function \(f\) is twice differentiable and is globally Lipschitz continuous with constant \(L\).
Assumption (F2) implies that for each \(θ \in \reals^d\), there is a unique function \(s(\cdot, θ)\) that satisfies the ODE \[ \frac{ds(t,θ)}{dt} = f(s(t,θ)), \quad s(0,θ) = θ. \]
F3. The equilibrium \(θ^*\) of the ODE \(\dot θ = f(θ)\) is globally asymptotically stable.
F3’. The equilibrium \(θ^*\) of the ODE \(\dot θ = f(θ)\) is globally exponentially stable. Thus, there exists constants \(μ \ge 1\) and \(γ > 0\) such that \[ \| s(t,θ) - θ^*\|_2 \le μ\|θ - θ^*\|_2 \exp(-γ t), \quad \forall t \ge 0, \forall θ \in \reals^d. \]
F4. There is a finite constant \(K\) such that \[ \| \nabla^2 f_i(θ) \|_{S} \cdot \| θ - θ^*\|_2 \le K, \quad \forall i \in \{1, \dots, d\}, \forall θ \in \reals^d, \] where \(\|\cdot\|_S\) denotes the spectral norm of a matrix (i.e., the largest singular value).
Assumption (F4) implies that \[ \left| \frac{∂^2 f_i(θ)}{∂θ_j ∂θ_k}\right| \cdot \| θ - θ^*\|_2 \le K, \quad \forall i.j,k, \in \{1,\dots, d\}, \forall θ \in \reals^d. \]
31.1.2 Conditions on the noise
N1. \(\{ξ_t\}_{t \ge 0}\) is a martingale difference sequence with respect to \(\{ \mathcal F_t\}_{t \ge 1}\), i.e., \[ \EXP[ ξ_{t+1} | \mathcal F_t ] = 0, \text{ a.s.}, \quad \forall t \ge 1. \]
N2. The noise \(\{ξ_t\}_{t \ge 1}\) satisfies \[ \EXP[ \| ξ_{t+1}^2 \|_2^2 \mid \mathcal F_t ] \le σ^2( 1 + \| θ_t - θ^*\|_{2}^2), \quad \text{a.s. } \forall t \ge 1 \] for some finite constant \(σ^2\).
31.1.3 Conditions on the learning rate
R1. \(\sum_{t \ge 1} α_t^2 < ∞\).
R2. \(\sum_{t \ge 1} α_t = ∞\).
R3. There exists constants \(\underline α, \bar α \in (0,1)\) such that \(\underline α \le α_t \le \bar α\) for all \(t \ge 1\).
31.2 Gladyshev’s result
The following is a restatement of the result of Gladyshev (1965).
Theorem 31.1 Suppose assumptions (F1’), (N1), and (N2) hold. In addition, the function \(f(\cdot)\) is passive, i.e., for each \(0 < ε < M < ∞\), \[ \sup_{ε < \| θ - θ^*\|_2 < M} \langle θ - θ^*, f(θ) \rangle < 0\] and \[\|f(θ)\|_2 \le K \|θ - θ^*\|_2, \quad K < ∞.\] Then,
- If (R1) holds, then \(\{θ_t\}\) is bounded almost surely.
- In addition, if (R2) holds, then \(θ_t \to θ^*\) almost surely as \(t \to ∞\).
The second assumption \(\|f(θ)\|_2 \le K \| θ - θ^*\|_2\) implies that \(f(⋅)\) is continuous as \(θ^*\), but it need not be continuous anywhere else.
31.3 Borkar-Meyn’s result
The following is a restatement of the result of Borkar and Meyn (2000).
Theorem 31.2 Suppose assumptions (F2), (N1), and (N2) hold. In addition:
There exists a limit function \(f_{∞} \colon \reals^d \to \reals^d\) such that \[ \lim_{r \to ∞} \frac{f(r θ)}{r} = f_{∞}(θ), \quad \forall θ \in \reals^d. \]
Origin is globally asymptotically stable equilibrium of the ODE \[ \dot θ(t) = f_{∞}(θ(t)). \]
Then,
If (R1) and (R2) hold, then \(\{θ_t\}_{t \ge 1}\) is bounded almost surely.
In addition, if (F3) holds, then \(θ_t \to θ^*\) almost surely as \(t \to ∞\).
If (F3) and (R3) hold, then:
there exists a \(α^* > 0\) and \(C_1 < ∞\) such that if \(\bar α \le α^*\) then \[ \limsup_{n \to ∞} \EXP[ \| θ_k\|^2 ] \le C_1. \]
if \(\bar α \le α^*\) then \(θ_t \to θ^*\) in probability. In particular, for any \(ε > 0\), there exists a \(b_1 = b_1(ε) < ∞\) such that \[ \limsup_{n \to ∞} \PR( \| θ_k - θ^* \| \ge ε ) \le b_1 \bar α. \]
In addition, if (F3’) holds, then \(θ_t \to θ^*\) in mean square. In particular, there exists a \(b_2 < ∞\) such that for any initial condition \(θ_0 \in \reals^d\), \[ \limsup_{n \to ∞} \EXP[ \| θ_k - θ^*\|^2 ] \le b_2 \bar α. \]
Borkar and Meyn (2000) also provides rates of convergence of stochastic approximation under stronger assumptions.
31.4 Vidyasagar’s result
The following is a restatement of the result of Vidyasagar (2023).
Consider a continuous function \(f \colon \reals_{\ge 0} \to \reals_{\ge 0}\).
- The function \(f\) is said to belong to class \(\mathcal K\) if \(f(0) = 0\) and \(f(\cdot)\) is strictly increasing.
- The function \(f \in \ALPHABET K\) is said to belong to class \(\ALPHABET K \ALPHABET R\) if, in addition, \(f(r) \to ∞\) as \(s \to ∞\).
- The function \(f\) is said to belong to class \(\ALPHABET B\) if \(f(0) = 0\) and, in addition for all \(0 < ε < M < ∞\), we have \[ \inf_{ε \le r \le M} f(r) > 0. \]
Note The notation of function class \(\ALPHABET B\) clashes with that of the Bellman operator. I hope that the distinction will be clear from context.
Example 31.2 Observe that every function \(f\) of class \(\ALPHABET K\) also belongs to class \(\ALPHABET B\) but the converse is not true. For example, let \[ f(r) = \begin{cases} r, & \text{if } r \in [0,1], \\ e^{-(r-1)}, & \text{if } r > 1. \end{cases} \] Then, \(f\) belongs to class \(\ALPHABET B\).
However, since \(f(r) \to 0\) as \(r \to ∞\), \(f\) cannot be bounded below by any function of class \(\ALPHABET K\).
Theorem 31.3 Suppose assumptions (F1), (F2), (N1), and (N2) hold. In addition, suppose that there exists a twice differentiable Lyapunov function \(V \colon \reals^d \to \reals_{\ge 0}\) that satisfies the following conditions:
There exist constants \(a, b > 0\) such that \[\begin{equation}\label{eq:vidyasagar-cond-1} a \| θ - θ^*\|_2^2 \le V(θ) \le b \| θ - θ^* \|_2^2, \quad \forall θ \in \reals^d. \end{equation}\]
There is a finite constant \(M\) such that \[\begin{equation}\label{eq:vidyasagar-cond-2} \| \GRAD^2 V(θ) \|_S \le 2M, \quad \forall θ \in \reals^d. \end{equation}\] Then,
If \(\dot V(θ) \coloneqq \langle \GRAD V(θ), f(θ) \rangle \le 0\) for all \(θ \in \reals^d\) and (R1) holds, then the iterates \(\{θ_t\}_{t \ge 1}\) are bounded almost surely.
If, in addition, (R2) holds and there exists a function \(\phi \in \ALPHABET B\) such that \[\begin{equation}\label{eq:vidyasagar-cond-3} \dot V(θ) \le - \phi(\| θ - θ^*\|_2), \quad \forall θ \in \reals^d. \end{equation}\] Then, \(θ_t \to θ^*\) almost surely as \(t \to ∞\).
Consider the ODE \[ \dot \theta = f(\theta), \quad \theta \in \reals^d. \] Consider a function \(V \colon \reals^d \to \reals_{\ge 0}\) that is continuous and differentiable and let \(\GRAD V\) denote the gradient of \(V\). Then, the time-derivative of \(V\) along the trajectories of the ODE is given by \[ \dot V(\theta) = \GRAD V(\theta) \cdot \dot \theta = \GRAD V(\theta) \cdot f(\theta) \] where the first equality follows from the chain rule. Thus, the conditions of Theorem 31.3 assert that there exists a Lyapunov function for the ODE (even though we do not use any property of the ODE analysis!)
Note that the typical conditions of Lyapunov stabilty assert that if there exists a Lyapunov function \(V \colon \reals^d \to \reals\) and functions \(η_1, η_2 \in \ALPHABET K \ALPHABET R\), \(\textcolor{red}{\phi \in \ALPHABET K}\) such that \[\begin{align*} η_1(\NORM{θ - θ^*}_2) &\le V(θ) \le η_2(\NORM{θ-θ^*}_2), \quad &&\forall θ \in \reals^d, \\ \dot V(θ) &\le - \phi(\NORM{θ -θ^*}_2), \quad &&\forall θ \in \reals^d, \end{align*}\] then \(θ^*\) is globally asymptotically stable equilibrium of the ODE \(\dot θ = f(θ)\). It is shown in (Vidyasagar 2023, Theorem 4) this this condition can be weakended to \(\phi \in \ALPHABET B\). Thus, the conditions of Theorem 31.3 imply (F3).
It is worthwhile to compare the conditions of Theorem 31.2 and Theorem 31.3.
In Theorem 31.2, it is assumed that (F1’) holds while in Theorem 31.3, it is assumed that (F1) holds. That is, there is no assumption that \(θ^*\) is the unique solution of \(f(θ) = 0\).
The assumptions on \(\dot V\) in part 1 of Theorem 31.3 imply only that \(θ^*\) is a locally stable equilibrim of the ODE \eqref{eq:ODE}. This is in contrast to Theorem 31.2 imply that \(θ^*\) is globally asymptotically stable.
As an illustration, consider \(f \colon \reals \to \reals\) given by \[ f(θ) = \begin{cases} -1 + \sin(θ + π/2), & θ \ge 0 \\ f(-θ), & θ < 0. \end{cases} \] The roots of \(f(θ) = 0\) are all \(θ \in \{ 2 πn : n \in \integers \}\). Suppose \(θ^* = 0\) is the solution of interest. Since \(f(θ) = 0\) has multiple solutions, \(θ^* = 0\) cannot be globally asymptotically stable. So (F3) does not hold. More importantly, the limit function \(f_{∞} ≡ 0\) because \[ f_{∞}(θ) = \lim_{r \to ∞} \frac{f(r θ)}{r} = 0. \] So, the ODE \(\dot θ = f_{\infty}(θ)\) cannot be globally asymptotically stable and therefore the results of Theorem 31.2 are not applicable. Nonetheless, it is easy to see that the first result of Theorem 31.3 is applicable.
In particular, consider the Lyapunov function \(V(θ) = θ^2\). Then, \(\dot V(θ) = θ \cdot f(θ) \le 0\) (can verify by plotting). Therefore, assumptions of Theorem 31.3 are satisfied. Consequently, whenever (R1) is satisfied, \(\{θ_t\}_{t \ge 1}\) is almost surely bounded.
However note that we cannot verify \(\eqref{eq:vidyasagar-cond-3}\). Therefore, we cannot argue that \(\theta_t \to \theta^*\) almost surely. This is not surprising. Since \(f(θ) = θ\) has multiple solutions, we will converge to one of them; not a specific one.
The assumptions in part 2 of Theorem 31.4 ensure that \(θ^*\) is globally asymptotically stable equilibrium of the ODE \eqref{eq:ODE}. Therefore, assumption (F1’) is implicit in the second part of Theorem 31.3.
We first start by establishing a bound on \(\EXP[V(θ_{t+1}) \mid \ALPHABET F_t]\). To do so, observe that by Taylor series, we have \[ V(θ + η) = V(θ) + \langle \GRAD V(θ), η \rangle + \frac 12 \langle η, \GRAD^2 V(θ + λη)η \rangle \] for some \(λ \in [0,1]\). Since \(\NORM{\GRAD^2 V(θ+λη)}_S \le 2M\), it follows that \[ V(θ + η) \le V(θ) + \langle \GRAD V(θ), η \rangle + M \NORM{η}_2^2. \] Now apply the above bound with \(θ = θ_t\) and \(η = θ_{t+1} - θ_t = α_t f(θ_t) + α_t ξ_{t+1}\). This gives \[\begin{align*} V(θ_{t+1}) &\le V(θ_t) + α_t \langle \GRAD V(θ_t), f(θ_t) \rangle + α_t \langle \GRAD V(θ_t), ξ_{t+1} \rangle \notag \\ &\quad + α_t^2 M \bigl[ \NORM{f(θ_t)}_2^2 + \NORM{ξ_{t+1}}_2^2 + 2 \langle f(θ_t), ξ_{t+1} \rangle \bigr] \end{align*}\] Recall that \(\langle V(θ), f(θ) \rangle \eqqcolon \dot V(θ)\). Now, we can bound \(\EXP[V(θ_{t+1}) \mid \ALPHABET F_t]\) using assumptions (N1) and (N2). \[\begin{equation*} \EXP[V(θ_{t+1}) \mid \ALPHABET F_t] \le V(θ_t) + α_t \dot V(θ_t) + α_t^2 M \bigl[ \NORM{f(θ_t)}_2^2 + σ^2(1 + \NORM{θ_t - θ^*}_2^2) \bigr] \end{equation*}\] Assumption (F1) and (F2) implies that \[ \NORM{f(θ_t)}_2^2 = \NORM{f(θ_t) - f(θ^*)}_2^2 \le L^2 \NORM{θ_t - θ^*}_2^2. \] Substituting in the above bound, we get: \[\begin{equation}\label{eq:vidyasagar-1-pf-step-1} \EXP[V(θ_{t+1}) \mid \ALPHABET F_t] \le V(θ_t) + α_t \dot V(θ_t) + α_t^2 M \bigl[ σ^2 + (σ^2 + L^2)\NORM{θ_t - θ^*}_2^2 \bigr]. \end{equation}\]
Proof of part 1.
Under the stated assumptions, we can simplify \eqref{eq:vidyasagar-1-pf-step-1} \[\begin{align*} \EXP[V(θ_{t+1}) \mid \ALPHABET F_t] &\stackrel{(a)}\le V(θ_t) + α_t^2 M \bigl[ σ^2 + (σ^2 + L^2)\NORM{θ_t - θ^*}_2^2 \bigr] \notag \\ &\stackrel{(b)}\le \biggl[ 1 + \frac{α_t^2 M}{a} (L^2 + σ^2) \biggr] V(θ_t) + α_t^2 M σ^2 \end{align*}\] where \((a)\) follows from the assumption that \(\dot V(θ) < 0\) and \((b)\) follows from \(V(θ) \ge a \NORM{θ - θ^*}_2^2\).
Thus, \(\{V(θ_t)\}_{t \ge 1}\) is an “almost” supermartingale. Apply Theorem 38.2 with \(X_t = V(θ_t)\), \(β_t = α_t^2 M(L^2 + σ^2)/a\), \(Y_t = α_t^2 M σ^2\), and \(Z_t = 0\). Then, from (R1) it follows that \(\lim_{t \to ∞} V(θ_t)\) exists almost surely and is finite. Condition \eqref{eq:vidyasagar-cond-1} implies that \(\{θ_t\}_{t \ge 1}\) is almost surely bounded.
Proof of part 2.
Under the stated assumptions, we can simplify \eqref{eq:vidyasagar-1-pf-step-1} \[\begin{align*} \EXP[V(θ_{t+1}) \mid \ALPHABET F_t] &\stackrel{(c)}\le \biggl[ 1 + \frac{α_t^2 M}{a} (L^2 + σ^2) \biggr] V(θ_t) + α_t^2 M σ^2 - α_t \phi(\NORM{θ_t - θ^*}_2) \end{align*}\] where the first two terms are simplified in the same way as above and the last term corresponds to the upper bound on \(\dot V(θ_t) \le -\phi(\NORM{θ_t - θ^*}_2^2)\).
We can again apply Theorem 38.2 with \(X_t = V(θ_t)\), \(β_t = α_t^2 M(L^2 + σ^2)/a\), \(Y_t = α_t^2 M σ^2\), and \(Z_t = α_t \phi(\NORM{θ_t - θ^*}_2)\). Thus, we can conclude that there exists a random variable \(ζ\) such that \(V(θ_t) \to ζ\) and \[\begin{equation}\label{eq:vidyasagar-1-pf-step-2} \sum_{t \ge 1} α_t \phi(\NORM{θ_t - θ^*}_2) < ∞, \quad \mathrm{a.s.} \end{equation}\]
Let \(Ω_1 \subset Ω\) denote the values of \(ω\) for which \[ \sup_{t \ge 1} V(θ_t(ω)) < ∞, \lim_{t \to ∞} V(θ_t(ω)) = ζ(ω), \sum_{t \ge 1} α_t \phi(\NORM{θ_t(ω) - θ^*}_2) < ∞. \] From Theorem 38.2, we know that \(P(Ω_1) = 1\). We will now show that \(ζ(ω) = 0\) for all \(ω \in Ω_1\) by contradiction. Assume that for some \(ω \in Ω_1\), we have \(ζ(ω) = 2 ε > 0\). Choose a \(T\) such that \(V(θ_t(ω)) \ge ε\) for all \(t \ge T\). Define \(V_M = \sup_{t \ge 1} V(θ_t(ω))\). Then, we have that \[ \sqrt{\frac{ε}{b}} \le \NORM{θ_t}_2 \le \sqrt{\frac{V_M}{a}}, \quad \forall t \ge T. \]
Define \(δ = \inf_{\sqrt{ε/b} \le r \le \sqrt{V_M/a}} \phi(r)\) and observe that \(δ > 0\) because \(\phi\) belongs to class \(\ALPHABET B\). Therefore, \[ \sum_{t \ge T} α_t \phi(\NORM{θ_t - θ^*}_2) \ge \sum_{t \ge T} α_t δ = ∞, \] due to (R2). But this contradicts \eqref{eq:vidyasagar-1-pf-step-2}. Hence, there is no \(ω \in Ω_1\) such that \(ζ(ω) > 0\). Therefore, \(ζ = 0\) almost surely, i.e., \(V(θ_t) \to 0\) almost surely. Finally, it follows from \eqref{eq:vidyasagar-cond-1} that \(θ_t \to θ^*\) almost surely as \(t \to ∞\).
Theorem 31.3 requires the existence of a suitable Lyapunov function that satisfies various conditions. Verifying whether or not such a function exists can be a bottleneck.
As argued above, the conditions of Theorem 31.3 imply (F3). If instead of (F3), we assume the stronger condition (F3’), then it is possible to establish the following “converse” Lyapunov theorem which guarantees the existence of such a Lyapunov function \(V\).
Theorem 31.4 Suppose assumptions (F1’), (F2’), (F3’) and (F4) hold. Then, there exists a twice differentiable function \(V \colon \reals^d \to \reals_{\ge 0}\) such that \(V\) and its derivative \(\dot V \colon \reals^d \to \reals_{\ge 0}\) defined as \(\dot V(θ) \coloneqq \langle \langle \GRAD V(θ), f(θ) \rangle\) together satisfy the following conditions: there exist positive constants \(a\), \(b\), \(c\), and a finite constant \(M\) such that for all \(θ \in \reals^d\):
- \(a\NORM{θ - θ^*}_2^2 \le V(θ) \le b\NORM{θ - θ^*}_2^2\),
- \(\dot V(θ) \le -c\NORM{θ - θ^*}_2^2\),
- \(\NORM{\GRAD^2 V(θ)}_S \le 2M\).
Combining Theorem 31.3 and Theorem 31.4, we get the following “self-contained” theorem:
Theorem 31.5 Suppose assumptions (F1’), (F2’), (F3’), and (F4) as well as assumptions (N1) and (N2) hold. Then,
- If (R1) holds then \(\{θ_t\}_{t \ge 1}\) is bounded almost surely.
- If, in addition, (R2) holds then \(\{θ_t\}_{t \ge 1}\) converges almost surely to \(θ^*\) as \(t \to ∞\).
31.5 Example: Temporal difference learning for a pseudo-contraction
Suppose \(\BELLMAN \colon \reals^d \to \reals^d\) is a pseudo-contraction with respect to the Eucledian norm, i.e., it has a fixed point \(θ^*\) and a radius of contraction \(γ \in (0, 1)\) such that \[\begin{equation}\label{eq:pseudo-contraction} \NORM{\BELLMAN θ - θ^*}_2 \le γ\NORM{θ - θ^*}_2, \quad \forall θ \in \reals^n. \end{equation}\] We assume that there is an oracle. When we give an input \(θ\) to the oracle, the oracle returns \(\BELLMAN θ + ξ\), where \(ξ\) is an independent noise. Suppose we run temporal difference update of the form: \[\begin{equation}\label{eq:TD} θ_{t+1} = (1 - α_t) θ_t + α_t \bigl[ \BELLMAN θ_t + ξ_{t+1} \bigr] \end{equation}\] where \(\{α_t\}_{t \ge 1}\) learning rate.
As before, we define \(\ALPHABET F_t = σ(θ_{1:t}, ξ_{1:t}, α_{1:t})\). Then, we have the following result:
Proposition 31.1 Suppose assumptions (N1) and (N2) hold. Then,
If (R1) holds, then the iterates \(\{θ_t\}_{t \ge 1}\) are bounded almost surely.
If, in addition, (R2) holds, then \(θ_t \to θ^*\) almost surely as \(t \to ∞\).
Observe that \(\eqref{eq:TD}\) can be viewed as a special case of \(\eqref{eq:SA}\) with \(f(θ) = \BELLMAN θ - θ\). We will prove the result using Theorem 31.3 with \[ V(θ) = \NORM{θ - θ^*}_2^2 \] as the candidate Lyapunov function. Clearly, \(V\) satisfies \(\eqref{eq:vidyasagar-cond-1}\) and \(\eqref{eq:vidyasagar-cond-2}\). Consider \[\begin{align} \dot V(\theta) &\coloneqq \langle \GRAD V(θ), f(θ) \rangle \notag \\ &= \langle θ - θ^*, \BELLMAN θ - θ \rangle \notag \\ &= \langle θ - θ^*, \BELLMAN θ - θ^* \rangle - \langle θ - θ^*, θ - θ^* \rangle \notag \\ &\stackrel{(a)}\le \NORM{θ - θ^*}_2 \NORM{ \BELLMAN θ - θ^* }_2 - \NORM{θ - θ^*}_2^2 \notag \\ &\stackrel{(b)}\le -(1-γ) \NORM{θ - θ^*}_2^2 \end{align}\] where \((a)\) follows from Cauchy-Schwartz inequality and \((b)\) follows from \(\eqref{eq:pseudo-contraction}\). Thus, \(\dot V(θ)\) satisfies \(\eqref{eq:vidyasagar-cond-3}\) with \(\phi(x) = (1-γ)x^2\). Thus, the result follows from Theorem 31.3.
Notes
The stochastic approximation algorithm was introduced by Robbins and Monro (1951). See Lai (2003) for a historical overview. The classical references on this material is Borkar (2008), Chen and Guo (1991), Kushner and Yin (1997). The idea using martingales to study the convergence of stochastic approximation was introduced by Blum (1954). Also see Gladyshev (1965).
Example 31.1 is borrowed from Borkar (2008), who points out that it was proposed by Arthur (1994) to model the phenomenon of decreasing returns in economics.
The material in this section is adapted from Vidyasagar (2023).