ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

Theory: Q-learning

Consider an MDP with finite state space \(\ALPHABET S\) and action space \(\ALPHABET A\), where \(|\ALPHABET S| = n\) and \(|\ALPHABET A| = m\). Recall the value iteration algorithm for infinite horizon MDPs. We start with an arbitrary initial \(V_0\) and then recursively compute \(V_{k+1} = \mathcal B V_k\). We first observe that the value iteration can be written in terms of the \(Q\)-function rather than the value function as follows: start with an arbitrary initial \(Q_0\) and the recursively compute \[ \begin{equation} \label{eq:Q} Q_{k+1}(s,a) = c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) \min_{a' \in \ALPHABET A} Q_k(s',a'), \qquad s \in \ALPHABET S, a \in \ALPHABET A, k \ge 0 \end{equation} \]

Suppose that the transition probabilities are unknown, but we have access to a simulator which can be used the sample the next state for any current state and action. Then, one can compute the fixed point of \eqref{eq:Q} using a stochastic approximation variant known as Q-learning.

1 Simple Q-learning algorithm

Let \(Ψ\) denote the simulator, where we use the notation \(S_{+} \sim Ψ(s,a)\) to denote that the next state \(S_{+}\) is sampled using the simulator with the current state input \(s\) and action \(a\). Now, consider the following stochastic approximation algorithm: \[\begin{equation} \label{eq:QL} Q_{k+1}(s,a) = Q_k(s,a) + a_k\bigl[ c(s,a) + γ \min_{a' \in \ALPHABET A} Q_k(Ψ(s,a), a') - Q_k(s,a) \bigr] \end{equation} \] \(s \in \ALPHABET S\), \(a \in \ALPHABET A\), where \(Ψ(s,a)\) is an independently sampled next state from the simulator. Note that in the above equation we are assuming that we update the \(Q\) function for all values of state action pairs \((s,a)\) at each iteration.

The iteration \eqref{eq:QL} may be viewed as a stochastic approximation algorithm where \(θ_k = Q_k \in \reals^{n×m}\) and \(f \colon \reals^{n×m} \to \reals^{n×m}\) given by \(f(Q) = [ f_{sa}(Q) ]\) where \[ f_{sa}(Q) = c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) \min_{a' \in \ALPHABET A} Q(s',a') - Q(s,a). \]

Define the martingale \(\{W_k\}_{k \ge 0}\) by \(W_{k+1} = [W_{k+1}]_{sa}\), where \[ [W_{k+1}]_{sa} = γ \Bigl( \min_{a'} Q_n(Ψ(s,a), a') - \sum_{s' \in \ALPHABET Y} P_{ss'}(a) \min_{a'} Q_n(s', a') \Bigr), \quad s \in \ALPHABET S, a \in \ALPHABET A. \]

Then iteration \eqref{eq:QL} may be viewed as the standard stochastic approximation algorithm where \[ Q_{k+1} = Q_k + a_k[ f(Q_k) + W_{k+1} ]. \]

To check that Q-learning converge, we need to check the convergence conditions of stochastic approximation. Since we sample \(Ψ\) independently at each step, the noise \(\{W_k\}\) is a martingale difference sequence which has bounded variance due to fact that the per-step cost is bounded. Thus, we show that Q-learning converges, we need to check the asymptotic stability properties of the ODE: \[ \begin{equation} \label{eq:Q-ODE} \dot Q(t) = f(Q(t)) \end{equation} \] and \[ \begin{equation} \label{eq:Q-ODE-inf} \dot Q(t) = f_∞(Q(t)) \end{equation} \] where \(f_∞(Q)\) is given by \(f_∞(Q) = [f_∞(Q)]_{sa}\) with \[ [f_∞(Q)]_{sa} = γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) \min_{a' \in \ALPHABET A} Q(s',a') - Q(s,a), \quad s \in \ALPHABET S, a \in \ALPHABET A. \]

We will use the following general property to prove the global asymptotic stability of \eqref{eq:Q-ODE} and \eqref{eq:Q-ODE-inf}.

Lemma 1

Let \(T \colon \reals^d \to \reals^d\) to be \(p\)-norm contraction. Then the ODE \[ \dot θ(t) = T(θ(t)) - θ(t) \] has a unique equilibrium point which is global asymptotically stable.

In particular, suppose \(θ^*\) is the unique fixed poit of \(T\). Then \(L(θ) = \| θ - θ^* \|_{p}\) is a Lyapunov function and \(θ(t) \to θ^*\).

This was originally proved in Borkar and Soumyanatha (1997) .


Let \(\text{sgn}(θ)\) denote the sign function that is \(+1\) is \(θ > 0\), \(0\) if \(θ = 0\), and \(-1\) is \(θ < 0\). Assume that \(1 \le p < ∞\). By the chain rule

\[\begin{align*} \frac{dL(θ)}{dt} &= \sum_{i=1}^d \frac{∂L}{∂θ_i} \frac{dθ_i}{dt} \\ &= \biggl[ \sum_{j = 1}^d | θ_i - θ^*_i |^p \biggr]^{(1-p)/p} \sum_{i=1}^d \text{sign}(θ_i - θ^*_i) | θ_i - θ^*_i |^{1-p} \bigl( T_i(θ) - θ_i \bigr) \end{align*}\] Note that the term outside the summation is \(\|θ - θ^*\|_{p}^{1-p}\) and we can write \(T_i(θ) - θ_i\) as \(T_i(θ) - T_i(θ^*) - (θ_i - θ^*_i)\) (because \(T(θ^*) = θ^*\). Making these substitutions in the above equation, we get \[ \frac{dL(θ)}{dt} = \| θ - θ^*\|_p^{1-p} \biggl[ \sum_{i=1}^d \text{sgn}(θ_i - θ^*_i) | θ_i - θ^*_i|^{p-1} \bigl( T_i(θ) - T_i(θ^*) \bigr) - \| θ - θ^* \|_{p} \] By Holder’s inequality \[\begin{align*} \sum_{i=1}^d & \text{sgn}(θ_i - θ^*_i) | θ_i - θ^*_i|^{p-1} \bigl( T_i(θ) - T_i(θ^*) \bigr) - \| θ - θ^* \|_{p} \\ &\le \biggl[ \sum_{i=1}^d \bigl(\text{sgn}(θ_i - θ^*_i) | θ_i - θ^*_i|^{p-1} \bigr)^{p/(p-1)} \biggr]^{(p-1)/p} \biggl[ \sum_{i=1}^d \bigl( T_i(θ) - T_i(θ^*) \bigr)^p \biggr]^{1/p} \\ &= \| θ - θ^* \|_{[}^{p-1}\, \| T(θ) - T(θ^*) \|_{p}. \end{align*}\] Subsituting this in the previous equation, we get \[ \frac{dL(θ)}{dt} \le \| T(θ) - T(θ^*)\|_p - \| θ - θ^* \|_{p}. \]

Note that we can subsitute \(p=∞\) in the above expression because \(\|z\|_p \to \|z \|_{∞}\) uniformly on compact sets. Since \(T\) is a \(p\)-norm contraction, \[ \| T(θ) - T(θ^*) \|_{p} \le γ \| θ - θ^* \|_{p}. \] Substituting this in the previous equation, we get \[ \frac{dL(θ)}{dt} \le \le - (1-γ) \| θ(t) - θ^* \|_p = -(1-γ) L(θ). \] Thus, \(L(θ) \to 0\) (or equivalently, \(\| θ(t) - θ^* \|_{p} \to 0\) as \(t \to ∞\)\(\Box\)

Define the operator \(F(Q) = [F_{sa}(Q)]\) by \[ \begin{equation} \label{eq:F} F_{sa}(Q) = c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) \min_{a'} Q(s',a'), \end{equation} \] which may be thought as an analogue of the Bellman operator for the \(Q\)-function. Similarly define \(F_∞(Q) = [F_{∞}(Q)]_{sa}\) by \[ [F_∞(Q)]_{sa} = γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) \min_{a'} Q(s',a'), \]

Then, we have have that \[ f(Q) = F(Q) - Q \quad\text{and}\quad f_∞(Q) = F_∞(Q) - Q \] where both \(F\) and \(F_∞\) are sup-norm contractions. (It is immediate from definition that \(F_∞\) is a sup-norm contraction. \(F(Q) = c + F_∞(Q)\) is therefore also a sup-norm contraction). Therefore, Lemma 1 implies that both ODEs \eqref{eq:Q-ODE} and \eqref{eq:Q-ODE-inf} have unique global asymptotically stable equilibrium points. Moreover, \(f_∞(0) = 0\). Therefore origin is the unique equilibrium point of \eqref{eq:Q-ODE-inf}. Therefore, we satisfy the conditions of stochastic approximation. Hence, if the step sizes \(\{a_k\}_{k \ge 0}\) satisfy either conditions (TS) or (BS), then Q-learning converges in the appropriate sense as described in the notes on stochastic approximation.

2 A single trajectory Q-learning algorithm

The previous version of Q-learning algorithm assumes that we have access to a simulator which samples the output for different choices of state-action pairs. In some applications, such a simulator is not available. Rather, one has to learn simply by interacting with the environment. The next variant of Q-learning does that. This variant is an off-policy learning algorithm, which means that the learner is following a behavioral property (which needs to be sufficiently exploring) while learning the optimal policy (on the side).

The algorithm presented in the previous section assumed that we update the \(Q\)-function for all state action pairs at each time. A more practical variation is the following. Let \(g\) be some random policy such that \[ \PR(A_k = a | S_k = s) > 0 \] for all state-action pairs \((s,a)\). Let \(\{x_k, u_k, c_k\}_{k \ge 0}\) denote the sequence of states, actions, and costs obtained when executing the policy \(g\). Then, the “on-trajectory” variation of \eqref{eq:QL} is the following: \[ \begin{equation} \label{eq:Q-traj} Q_{k+1}(x_k, u_k) = Q_k(x_k, u_k) + a_k(x_k, u_k)\bigl[ c_k + γ \min_{a' \in \ALPHABET W}Q_k(x_{t+1}, a') - Q_k(x_k, u_k) \bigr]. \end{equation} \] This means that, at the \((t+1)\)-th update, only the component \((x_k, u_k)\) is updated.

The convergence analysis of this algorithm relies on the following result from stochastic approximation (see (Jaakkola et al. 1994))

Lemma 2

A \(\reals^d\) valued stochastic process \(\{Δ_k\}_{k \ge 0}\) given by \[ \begin{equation} \label{eq:DT} Δ_{k+1}(i) = Δ_k(i) + a_k(i) (W_k(i) - Δ_k(i)), \quad i \in \{1, \dots d\}, \end{equation} \] converges to zero almost surely under the following assumptions:

  • \(0 \le a_k(i) \le 1\), \(\sum_{k} a_k(i) = ∞\), and \(\sum_{k} a_k^2(i) < ∞\), for all \(i \in \{1, \dots, d\}\).
  • Let \(\mathcal F_k\) denote the increasing family of \(σ\)-fields \[ \mathcal F_k = σ(Δ_{1:k}, W_{1:k}), \quad k \ge 0. \] Then, for some norm \(\| ⋅ \|\) and for all \(i \in \{1, \dots, d\}\), \[ \| \EXP[ W_k(i) | \mathcal F_k] \| \le γ \| Δ_k \|, \quad \text{where } γ \in (0, 1) \] and \[ \text{var}( W_k(i) | \mathcal F_k) \le C_0(1 + \| Δ_k \|)^2, \quad \text{where $C$ is some constant}. \]

To apply the above result, define the stochastic process \(\{Δ_k\}_{k \ge 1}\) where \(Δ_k \in \reals^{n×m}\) and is given by \[ Δ_k(s,a) = Q_k(s,a) - Q^*(s,a) \] where \(Q^*\) is the optimal \(Q\)-function. Then the iteration \eqref{eq:Q-traj} is of the form \eqref{eq:DT} where \[ W_k(x_k,u_k) = c_k + γ \min_{a' \in \ALPHABET A}Q_k(x_{k+1}, a') - Q^*(x_k, u_k). \] Now, note that \[ \EXP[ W_k(x_k, u_k) ] = (F Q_k)(x_k, u_k) - Q^*(x_k,u_k) = (F Q)(x_k,u_k) - (F Q^*)(x_k,u_k), \] where the operator \(F\) is defined in \eqref{eq:F} and we have used the fact that \(Q^* = FQ^*\). From the contraction property of \(F\), we get that \[ \EXP[ W_k(x_k, u_k) ] \le γ \| Q_k - Q^* \|_∞ = γ \| Δ_k \|_∞. \]

Finally, \[\begin{align*} \text{var}[ W_k(s) | \mathcal F_k ] &= \EXP[ c_k + γ \min_{a' \in \ALPHABET A} Q_{t}(x_{t+1}, a') - (HQ_k)(x_k,u_k) )^2 | \mathcal F_k ] \\ &= \text{var}( c_k + γ \min_{a' \in \ALPHABET A} Q_{t}(x_{t+1}, a') | \mathcal F_k] \end{align*}\] which is bounded because of the fact that \(c\) (and therefore \(Q\)) are bounded. Then, the iteration \eqref{eq:Q-traj} satisfies the properties of Lemma 2. Therefore \(Δ_k \to 0\) or, equivalently, \(Q_k \to Q^*\), almost surely.


The Q-learning algorithm is due to Watkins and Dayan (1992). The proof here is from Borkar and Meyn (2000). Also see Tsitsiklis (1994) for a asynchronous version of Q-learning algorithm.

Borkar, V.S. and Meyn, S.P. 2000. The o.d.e. Method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 2, 447–469. DOI: 10.1137/s0363012997331639.
Borkar, V.S. and Soumyanatha, K. 1997. An analog scheme for fixed point computation. I. theory. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 44, 4, 351–355. DOI: 10.1109/81.563625.
Jaakkola, T., Jordan, M.I., and Singh, S.P. 1994. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6, 6, 1185–1201. DOI: 10.1162/neco.1994.6.6.1185.
Tsitsiklis, J.N. 1994. Asynchronous stochastic approximation and q-learning. Machine Learning 16, 3, 185–202. DOI: 10.1007/bf00993306.
Watkins, C.J.C.H. and Dayan, P. 1992. Q-learning. Machine Learning 8, 3-4, 279–292. DOI: 10.1007/bf00992698.

This entry was last updated on 16 Dec 2021