30 The learning setup
So far, we have assumed that the decision maker knows the system model. We now consider the case when the system model is not known but one of the following is true:
- the decision maker acts in the environments, i.e., observes the state and the per-step reward and chooses the action.
- the decision maker has access to a system simulator, can provide state-action input to the simulator and observe the next state and the realized reward.
- the decision maker has access to an offline dataset of interactions with the environment (i.e., tuples of (current state, current action, current reward, next state)); the data might have been generated by a sub-optimal policy.
To fix ideas, we start with the infinite horizon discounted reward setting. The same fundamental concerns are present in the finite horizon and infinite horizon average cost setting as well.
Thus, we have the same model as before. There is a controlled Markov process \(\{S_t\}_{t \ge 1}\), \(S_t \in \ALPHABET S\), controlled by the process \(\{A_t\}_{t \ge 1}\), \(A_t \in \ALPHABET A\). The system dynamics are time-homogeneous and are given by the transition matrices \(P(a)\), \(a \in \ALPHABET A\). At each step, the system yields a reward \(R_t = r(S_t, A_t, S_{t+1})\). The dynamics \(P\) and the reward \(r\) are not known to the decision maker.
Let \(θ^∘\) denote the unknown parameters of \(P\) and \(r\). For simplicity, we assume that \(θ^∘ \in Θ\), where \(Θ\) is a compact set. We will assume that the decision maker knows the set \(Θ\) but not the parameter \(θ^∘\).
The objective is to identify a control policy \(π = (π_1, π_2, \dots)\), where as before we start with the assumption that the policy is history dependent and the control action at time \(t\) is given by \[ A_t = \pi_t(S_{1:t}, A_{1:t-1}).\]
The performance of a policy is still given the expected discounted reward under that policy, which of course depends on the model parameters. So, we the performance of policy \(π\) under model \(θ\) by \[ J(π,θ) = \EXP^{\pi}_{θ} \biggl[ \sum_{t=1}^{∞} γ^{t-1} R_t \biggr] \] where the notation \(\EXP^{π}_{θ}\) denotes the fact that the expectation depends on the policy \(π\) and model parameters \(θ\).
30.1 The learning objective
Modulo the difference between cost minimization and reward maximization, the setup above is the same as the models that we have considered before with one difference: earlier it was assumed that the decision maker knows the system model \((P,r)\) while in the current setup it does not.
The objective is also the same as before: choose a policy to maximize the expected discounted reward. There are different ways formalize this objective. One possibility is to seek a policy \(π^\star\) that has the property that for all other history dependent policies \(π\) and all model parameters \(θ\), we have: \[ J(π^\star, θ) \ge J(π,θ). \] This is equivalent to a min-max problem: \[\begin{equation}\label{eq:min-max} \tag{min-max-opt} \max_{π} \min_{θ \in Θ} J(π,θ), \end{equation}\] where the max is over all history dependent policies. This approach is akin to a frequentist approach in inference. It is also possible to set up the problem in a Bayesian manner. In particular, suppose there is a prior \(μ\) on the set \(Θ\). Then the Bayesian approach is to solve the following: \[\begin{equation}\label{eq:Bayes} \tag{Bayes-opt} \max_{π} \int_{Θ} J(π,θ) μ(d θ). \end{equation}\]
Both \(\eqref{eq:min-max}\) and \(\eqref{eq:Bayes}\) are difficult problems. Note that unlike the planning setting where we showed that there is no loss of optimality in restricting attention to Markov policies (Theorem 5.1), we have to use history dependent policies in the learning setting.
We will start with a simpler optimization problem. Let \(\pi^\star_{θ}\) denote the optimal time-homogeneous policy for the system with model parameters \(θ\), \(θ \in Θ\). We say that a learning policy is asymptotically optimal if \[\begin{equation} \tag{asym-opt}\label{eq:asym} \lim_{t \to ∞} π_t(s_{1:t},a_{1:t-1}) = π^\star_{θ}(s_t), \quad P_{θ} \hbox{ a.s.} \quad \forall θ \in Θ. \end{equation}\]
30.2 The different learning settings
The choice of an algorithm to find a good policy depends on what information is available to the learning agent. The aspirational setting is that of online learning: an agent which is operating in the environment and generates an experience \((S_1,A_1,R_1,S_2,A_2,R_2,\dots)\) and uses that experience to learn an optimal policy (either in the sense of \(\eqref{eq:asym}\) or \(\eqref{eq:min-max}\) or \(\eqref{eq:Bayes}\)). There are two settings of online learning: on policy learning, where the agent adapts its policy over time and off policy learning, where the agent acts according to a behavioral policy but learns an optimal policy based on the gathered experience.
A simpler setting is that of offline learning: where the agent has access to a system simulator and, at any time, can input a state action pair \((s,a)\) to the simulator and receives \(S_{+} \sim P(\cdot | s,a)\) and \(R \sim r(S,A,S_{+})\) as outputs. A specific but harder case of offline learning is learning from a dataset which is generated by a specific (and not necessarily optimal) behavioral policy.
30.3 High-level overview of the learning algorithms
Loosely speaking, there are two approaches to find an asymptotically optimal policy. The simplest idea is model based learning: use the experience to generate estimates \(\hat θ_t\) of the model parameters, and then choose a policy related to the optimal policy of the estimated parameter. These algorithms differ in how they trade-off exploration and exploitation are are broadly classified as follows:
Certainty equivalence or plug-in estimator. In these methods, one generates an estimate \(\hat θ_t\) (e.g., least squares estimator or maximum likelihood estimator) and then chooses the control action as a noisy version of \(π^\star_{\hat θ_t}(s_t)\). For continuous state models such as LQR, the noisy version is usually just addive noise, thus the control is \[ a_t = π^\star_{\hat θ_t}(s_t) + ε_t \] where \(ε_t \sim {\cal N}(0,σ_t^2I)\) and the variance \(σ_t^2\) is slowly reduced in a controlled manner. For discrete state models, the noisy version of control is often chosen as an \(ε\)-greedy policy, i.e., \[ a_t = \begin{cases} π^\star_{\hat θ_t}(s_t), & \hbox{w.p. } 1 - ε_t \\ \hbox{random action}, & \hbox{w.p. } ε_t \end{cases} \] where the exploration rate \(ε_t\) is slowly reduced in a controlled manner.
Upper confidence based reinfocement learning (UCRL) In these methods, one generates an upper confidence bound based estimator \(\bar θ_t\) and then chooses a control action \[ a_t = π^\star_{\bar θ_t}(s_t). \]
Thompson/Posterior sampling (PSRL) This is a Bayesian method where one maintains a posterior distribution \(μ_t\) on the unknown parameters \(θ\) of the system model. At appropriately chosen times, one samples a model \(θ_t\) from the posterior \(μ_t\) and chooses the control action \[ a_t = π^\star_{θ_t}(s_t). \]
Another class of algorithms are model-free algorithms, which mimic the algorithms to solve MDPs. These are broadly classified as
Value-based algorithms which may be viewed as sampling based version of value iteration. Examples include Q-learning and its variations.
Policy-based algorithms which may be viewed as sampling based versions of policy iteration. Examples include actor-critic algorithms based on policy gradients.
Linear-programming based algorithms which may be viewed as sampling based versions of linear programming approach to solve MDPs.
30.4 Beyond asymptotic optimality
Not all asymptotically optimal learning algorithms are equal. Second order optimality considerations include sample complexity and regret.
Sample complexity: Given an accuracy-level \(α\) and a probability \(p\), how many samples \(T = T(α,p)\) are needed such that with probability at least \(1-p\) we have \[ \NORM{V^\star - V^{\pi_T}} \le α? \] The scaling of \(T\) with \(α\) and \(p\) is known as the sample complexity.
Regret: Given a horion \(T\), what is the difference between sample path performance of the learning algorithm until time \(T\) compared to the performance of the optimal policy until time \(T\).
The general approach for characterizing sample complexity and regret proceeds as follows. We establish lower bounds which construct models such that no algorithm can learn such models without using a certain number of samples or incuring a certain regret. Then, one tries to upper bound the sample complexity or regret of a specific algorithm. If the lower and the upper bounds match (up to logarithmic factors), we say that the algorithm under consideration is order optimal.
Notes
A good resource for getting started in reinforcement learning is Sutton and Barto (2018). For a more formal treatment, see Bertsekas and Tsitsiklis (1996).