# ECSE 506: Stochastic Control and Decision Theory

Theory: Q-learning

Consider an MDP with finite state space $$\ALPHABET S$$ and action space $$\ALPHABET A$$, where $$|\ALPHABET S| = n$$ and $$|\ALPHABET A| = m$$. Recall the value iteration algorithm for infinite horizon MDPs. We start with an arbitrary initial $$V_0$$ and then recursively compute $$V_{k+1} = \mathcal B V_k$$. We first observe that the value iteration can be written in terms of the $$Q$$-function rather than the value function as follows: start with an arbitrary initial $$Q_0$$ and the recursively compute $$$\label{eq:Q} Q_{k+1}(s,a) = c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) \min_{a' \in \ALPHABET A} Q_k(s',a'), \qquad s \in \ALPHABET S, a \in \ALPHABET A, k \ge 0$$$

Suppose that the transition probabilities are unknown, but we have access to a simulator which can be used the sample the next state for any current state and action. Then, one can compute the fixed point of \eqref{eq:Q} using a stochastic approximation variant known as Q-learning.

# 1 Simple Q-learning algorithm

Let $$Ψ$$ denote the simulator, where we use the notation $$S_{+} \sim Ψ(s,a)$$ to denote that the next state $$S_{+}$$ is sampled using the simulator with the current state input $$s$$ and action $$a$$. Now, consider the following stochastic approximation algorithm: $$$\label{eq:QL} Q_{k+1}(s,a) = Q_k(s,a) + a_k\bigl[ c(s,a) + γ \min_{a' \in \ALPHABET A} Q_k(Ψ(s,a), a') - Q_k(s,a) \bigr]$$$ $$s \in \ALPHABET S$$, $$a \in \ALPHABET A$$, where $$Ψ(s,a)$$ is an independently sampled next state from the simulator. Note that in the above equation we are assuming that we update the $$Q$$ function for all values of state action pairs $$(s,a)$$ at each iteration.

The iteration \eqref{eq:QL} may be viewed as a stochastic approximation algorithm where $$θ_k = Q_k \in \reals^{n×m}$$ and $$f \colon \reals^{n×m} \to \reals^{n×m}$$ given by $$f(Q) = [ f_{sa}(Q) ]$$ where $f_{sa}(Q) = c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) \min_{a' \in \ALPHABET A} Q(s',a') - Q(s,a).$

Define the martingale $$\{W_k\}_{k \ge 0}$$ by $$W_{k+1} = [W_{k+1}]_{sa}$$, where $[W_{k+1}]_{sa} = γ \Bigl( \min_{a'} Q_n(Ψ(s,a), a') - \sum_{s' \in \ALPHABET Y} P_{ss'}(a) \min_{a'} Q_n(s', a') \Bigr), \quad s \in \ALPHABET S, a \in \ALPHABET A.$

Then iteration \eqref{eq:QL} may be viewed as the standard stochastic approximation algorithm where $Q_{k+1} = Q_k + a_k[ f(Q_k) + W_{k+1} ].$

To check that Q-learning converge, we need to check the convergence conditions of stochastic approximation. Since we sample $$Ψ$$ independently at each step, the noise $$\{W_k\}$$ is a martingale difference sequence which has bounded variance due to fact that the per-step cost is bounded. Thus, we show that Q-learning converges, we need to check the asymptotic stability properties of the ODE: $$$\label{eq:Q-ODE} \dot Q(t) = f(Q(t))$$$ and $$$\label{eq:Q-ODE-inf} \dot Q(t) = f_∞(Q(t))$$$ where $$f_∞(Q)$$ is given by $$f_∞(Q) = [f_∞(Q)]_{sa}$$ with $[f_∞(Q)]_{sa} = γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) \min_{a' \in \ALPHABET A} Q(s',a') - Q(s,a), \quad s \in \ALPHABET S, a \in \ALPHABET A.$

We will use the following general property to prove the global asymptotic stability of \eqref{eq:Q-ODE} and \eqref{eq:Q-ODE-inf}.

Lemma 1

Let $$T \colon \reals^d \to \reals^d$$ to be $$p$$-norm contraction. Then the ODE $\dot θ(t) = T(θ(t)) - θ(t)$ has a unique equilibrium point which is global asymptotically stable.

In particular, suppose $$θ^*$$ is the unique fixed poit of $$T$$. Then $$L(θ) = \| θ - θ^* \|_{p}$$ is a Lyapunov function and $$θ(t) \to θ^*$$.

This was originally proved in Borkar and Soumyanatha (1997) .

#### Proof

Let $$\text{sgn}(θ)$$ denote the sign function that is $$+1$$ is $$θ > 0$$, $$0$$ if $$θ = 0$$, and $$-1$$ is $$θ < 0$$. Assume that $$1 \le p < ∞$$. By the chain rule

\begin{align*} \frac{dL(θ)}{dt} &= \sum_{i=1}^d \frac{∂L}{∂θ_i} \frac{dθ_i}{dt} \\ &= \biggl[ \sum_{j = 1}^d | θ_i - θ^*_i |^p \biggr]^{(1-p)/p} \sum_{i=1}^d \text{sign}(θ_i - θ^*_i) | θ_i - θ^*_i |^{1-p} \bigl( T_i(θ) - θ_i \bigr) \end{align*} Note that the term outside the summation is $$\|θ - θ^*\|_{p}^{1-p}$$ and we can write $$T_i(θ) - θ_i$$ as $$T_i(θ) - T_i(θ^*) - (θ_i - θ^*_i)$$ (because $$T(θ^*) = θ^*$$. Making these substitutions in the above equation, we get $\frac{dL(θ)}{dt} = \| θ - θ^*\|_p^{1-p} \biggl[ \sum_{i=1}^d \text{sgn}(θ_i - θ^*_i) | θ_i - θ^*_i|^{p-1} \bigl( T_i(θ) - T_i(θ^*) \bigr) - \| θ - θ^* \|_{p}$ By Holder’s inequality \begin{align*} \sum_{i=1}^d & \text{sgn}(θ_i - θ^*_i) | θ_i - θ^*_i|^{p-1} \bigl( T_i(θ) - T_i(θ^*) \bigr) - \| θ - θ^* \|_{p} \\ &\le \biggl[ \sum_{i=1}^d \bigl(\text{sgn}(θ_i - θ^*_i) | θ_i - θ^*_i|^{p-1} \bigr)^{p/(p-1)} \biggr]^{(p-1)/p} \biggl[ \sum_{i=1}^d \bigl( T_i(θ) - T_i(θ^*) \bigr)^p \biggr]^{1/p} \\ &= \| θ - θ^* \|_{[}^{p-1}\, \| T(θ) - T(θ^*) \|_{p}. \end{align*} Subsituting this in the previous equation, we get $\frac{dL(θ)}{dt} \le \| T(θ) - T(θ^*)\|_p - \| θ - θ^* \|_{p}.$

Note that we can subsitute $$p=∞$$ in the above expression because $$\|z\|_p \to \|z \|_{∞}$$ uniformly on compact sets. Since $$T$$ is a $$p$$-norm contraction, $\| T(θ) - T(θ^*) \|_{p} \le γ \| θ - θ^* \|_{p}.$ Substituting this in the previous equation, we get $\frac{dL(θ)}{dt} \le \le - (1-γ) \| θ(t) - θ^* \|_p = -(1-γ) L(θ).$ Thus, $$L(θ) \to 0$$ (or equivalently, $$\| θ(t) - θ^* \|_{p} \to 0$$ as $$t \to ∞$$$$\Box$$

Define the operator $$F(Q) = [F_{sa}(Q)]$$ by $$$\label{eq:F} F_{sa}(Q) = c(s,a) + γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) \min_{a'} Q(s',a'),$$$ which may be thought as an analogue of the Bellman operator for the $$Q$$-function. Similarly define $$F_∞(Q) = [F_{∞}(Q)]_{sa}$$ by $[F_∞(Q)]_{sa} = γ \sum_{s' \in \ALPHABET S} P_{ss'}(a) \min_{a'} Q(s',a'),$

Then, we have have that $f(Q) = F(Q) - Q \quad\text{and}\quad f_∞(Q) = F_∞(Q) - Q$ where both $$F$$ and $$F_∞$$ are sup-norm contractions. (It is immediate from definition that $$F_∞$$ is a sup-norm contraction. $$F(Q) = c + F_∞(Q)$$ is therefore also a sup-norm contraction). Therefore, Lemma 1 implies that both ODEs \eqref{eq:Q-ODE} and \eqref{eq:Q-ODE-inf} have unique global asymptotically stable equilibrium points. Moreover, $$f_∞(0) = 0$$. Therefore origin is the unique equilibrium point of \eqref{eq:Q-ODE-inf}. Therefore, we satisfy the conditions of stochastic approximation. Hence, if the step sizes $$\{a_k\}_{k \ge 0}$$ satisfy either conditions (TS) or (BS), then Q-learning converges in the appropriate sense as described in the notes on stochastic approximation.

# 2 A single trajectory Q-learning algorithm

The previous version of Q-learning algorithm assumes that we have access to a simulator which samples the output for different choices of state-action pairs. In some applications, such a simulator is not available. Rather, one has to learn simply by interacting with the environment. The next variant of Q-learning does that. This variant is an off-policy learning algorithm, which means that the learner is following a behavioral property (which needs to be sufficiently exploring) while learning the optimal policy (on the side).

The algorithm presented in the previous section assumed that we update the $$Q$$-function for all state action pairs at each time. A more practical variation is the following. Let $$g$$ be some random policy such that $\PR(A_k = a | S_k = s) > 0$ for all state-action pairs $$(s,a)$$. Let $$\{x_k, u_k, c_k\}_{k \ge 0}$$ denote the sequence of states, actions, and costs obtained when executing the policy $$g$$. Then, the “on-trajectory” variation of \eqref{eq:QL} is the following: $$$\label{eq:Q-traj} Q_{k+1}(x_k, u_k) = Q_k(x_k, u_k) + a_k(x_k, u_k)\bigl[ c_k + γ \min_{a' \in \ALPHABET W}Q_k(x_{t+1}, a') - Q_k(x_k, u_k) \bigr].$$$ This means that, at the $$(t+1)$$-th update, only the component $$(x_k, u_k)$$ is updated.

The convergence analysis of this algorithm relies on the following result from stochastic approximation (see )

Lemma 2

A $$\reals^d$$ valued stochastic process $$\{Δ_k\}_{k \ge 0}$$ given by $$$\label{eq:DT} Δ_{k+1}(i) = Δ_k(i) + a_k(i) (W_k(i) - Δ_k(i)), \quad i \in \{1, \dots d\},$$$ converges to zero almost surely under the following assumptions:

• $$0 \le a_k(i) \le 1$$, $$\sum_{k} a_k(i) = ∞$$, and $$\sum_{k} a_k^2(i) < ∞$$, for all $$i \in \{1, \dots, d\}$$.
• Let $$\mathcal F_k$$ denote the increasing family of $$σ$$-fields $\mathcal F_k = σ(Δ_{1:k}, W_{1:k}), \quad k \ge 0.$ Then, for some norm $$\| ⋅ \|$$ and for all $$i \in \{1, \dots, d\}$$, $\| \EXP[ W_k(i) | \mathcal F_k] \| \le γ \| Δ_k \|, \quad \text{where } γ \in (0, 1)$ and $\text{var}( W_k(i) | \mathcal F_k) \le C_0(1 + \| Δ_k \|)^2, \quad \text{where C is some constant}.$

To apply the above result, define the stochastic process $$\{Δ_k\}_{k \ge 1}$$ where $$Δ_k \in \reals^{n×m}$$ and is given by $Δ_k(s,a) = Q_k(s,a) - Q^*(s,a)$ where $$Q^*$$ is the optimal $$Q$$-function. Then the iteration \eqref{eq:Q-traj} is of the form \eqref{eq:DT} where $W_k(x_k,u_k) = c_k + γ \min_{a' \in \ALPHABET A}Q_k(x_{k+1}, a') - Q^*(x_k, u_k).$ Now, note that $\EXP[ W_k(x_k, u_k) ] = (F Q_k)(x_k, u_k) - Q^*(x_k,u_k) = (F Q)(x_k,u_k) - (F Q^*)(x_k,u_k),$ where the operator $$F$$ is defined in \eqref{eq:F} and we have used the fact that $$Q^* = FQ^*$$. From the contraction property of $$F$$, we get that $\EXP[ W_k(x_k, u_k) ] \le γ \| Q_k - Q^* \|_∞ = γ \| Δ_k \|_∞.$

Finally, \begin{align*} \text{var}[ W_k(s) | \mathcal F_k ] &= \EXP[ c_k + γ \min_{a' \in \ALPHABET A} Q_{t}(x_{t+1}, a') - (HQ_k)(x_k,u_k) )^2 | \mathcal F_k ] \\ &= \text{var}( c_k + γ \min_{a' \in \ALPHABET A} Q_{t}(x_{t+1}, a') | \mathcal F_k] \end{align*} which is bounded because of the fact that $$c$$ (and therefore $$Q$$) are bounded. Then, the iteration \eqref{eq:Q-traj} satisfies the properties of Lemma 2. Therefore $$Δ_k \to 0$$ or, equivalently, $$Q_k \to Q^*$$, almost surely.

# References

The Q-learning algorithm is due to Watkins and Dayan (1992). The proof here is from Borkar and Meyn (2000). Also see Tsitsiklis (1994) for a asynchronous version of Q-learning algorithm.

Borkar, V.S. and Meyn, S.P. 2000. The o.d.e. Method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 2, 447–469. DOI: 10.1137/s0363012997331639.
Borkar, V.S. and Soumyanatha, K. 1997. An analog scheme for fixed point computation. I. theory. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 44, 4, 351–355. DOI: 10.1109/81.563625.
Jaakkola, T., Jordan, M.I., and Singh, S.P. 1994. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6, 6, 1185–1201. DOI: 10.1162/neco.1994.6.6.1185.
Tsitsiklis, J.N. 1994. Asynchronous stochastic approximation and q-learning. Machine Learning 16, 3, 185–202. DOI: 10.1007/bf00993306.
Watkins, C.J.C.H. and Dayan, P. 1992. Q-learning. Machine Learning 8, 3-4, 279–292. DOI: 10.1007/bf00992698.

This entry was last updated on 16 Dec 2021