ECSE 506: Stochastic Control and Decision Theory
Aditya Mahajan
Winter 2022
About  Lectures  Notes  Coursework
 Assignments

 Assignment 1. Due date: 21 Jan 2022 (10 marks)
 Assignment 2. Due date: 28 Jan, 2022 (10 marks)
 Assignment 3. Due date: 4 Feb, 2022 (10 marks)
 Assignment 4. Due date: 11 Feb, 2022 (10 marks)
 Assignment 5. Due date: 7 March, 2022 (10 marks)
 Assignment 6. Due date: 18 March 2022 (10 marks)
 Project

The report must not be any longer than 1015 pages. The grade distribution for different parts of the report are as follows.
 Background, presentation of the problem setting, and literature overview: 20%
 Summary of the results, proof outlines, and examples to illustrate the main results: 50%
 Critical evaluation of the contributions: 20%
 Clarity of the presentation: 10%
The potential project topics are listed below. You can also pick up on the discussion of the literature in the reference section of the notes. You are also free to pick topics outside of these topics. Please send me an email with the topic that you want to work on, and we can finalize which papers should be covered in the term project.
You may wish to view some of the past term projects to get an idea of the size of structure of the project reports.
1 Potential Project topics
This is a partial list of topics. If you have a specific topic or direction in mind, please feel free to discuss that with me.
1.1 Interesting papers
These are a biased collection of interesting papers and could form the basis of a term project. For some of these, you might have to combine them with other follow up work to have substantial material for a term project.
An interesting characterization of optimal policies (for general decision problems) is that they must be “thrifty” and “equalizing”. An interpretation of this property for MDPs is presented in Karatzas and Sudderth (2010).
In the class, we primarily focussed on structural properties for finite horizon models. An obvious next question is—when do these properties hold for infinite horizon models. An answer to this question in broad generality is provided in Smith and McCardle (2002). They show that many of these results can be viewed as properties of closed convex cones.
1.2 Average cost MDPs
In the course, we primarily focused on discounted cost settings. A related setup is average cost setting.
A basic treatment of average cost MDPs with finite state and finite action is covered in all standard books: Kumar and Varaiya (1986), Puterman (2014), Bertsekas (2011). An accessible treatment for countable or uncountable state spaces is presented in Sennott (1999). A detailed survey of many of the technical results for average cost models over general state spaces is in Arapostathis et al. (1993).
A potential term project on this topic could cover the derivation of the average cost dynamic program under the weak accessibility conditions and a discussion of the value iteration, policy iteration, and linear programming algorithms to solve average cost MDPs.
An interesting connection with discounted cost models is the vanishing discount approach. This is not deep enough to be a term project on its own, but it can either be part of a broader project or connect with Tauberian theorems in analysis.
Another potential topic for a term project related to this topic is approximation bounds for average cost problems. There is some material on approximate dynamic programming for average cost MDPs in Bertsekas (2013). Some recent results on error bounds for discretization of average cost MDPs are in Saldi, Linder, Yuksel (add citation).
1.3 Multiarmed bandits
Multiarmed bandits^{1} are a special class of resouce allocation problems which have a low complexity solution. In general, scheduling problems can be modeled as MDPs, but the state space of such problems in exponential in the number of alternatives. This is sometimes referred to as the curse of dimensionality. A conceptual breakthrough in such problem was achieved when Gittins and Jones (1974); Gittins (1979) showed that the a low complexity solution exists.
A potential term project on this topic is to review the basic result of Gittins (1979). The proof in Gittins (1979) uses interchange argument, which we did not cover in the course. One option is to overview the method of interchange arguments, and then explain Gittins result using that. Another alternative is to present the dynamic programming proof of Gittins result, which was given by Whittle (1980). Since then multiple proofs of Gittins result have been presented in the literature. A survey of the four most popular results is presented in Frostig and Weiss (2016).
Another potential term project is to present the restartinstate formulation of the multiarmed problem presented in Katehakis and Veinott (1987). An interesting generalization of this idea for multiarmed bandits with commitments is presented in Cowan and Katehakis (2015).
The result of Gittins (1979) is derived under the assumption that the alternatives that are not selected do not evolve. Models where this assumption is violated are called restless bandits (Whittle 1988). Another potential term project is to present an overview of Gittins (1979), Whittle (1988), and other generalizations of multiarmed bandits. See Mahajan and Teneketzis (2008) for an overview.
1.4 Monotonicity of optimal policies
When we studied monotonicity of optimal policies in class, we assumed that the state and action space were totally ordered. Some of these results generalize to models where the state and action spaces are not totally ordered. An interesting example of such a result is presented in Papadaki and Powell (2007). General conditions for the value function and optimal policy to be monotone are derived in Topkis (1998).
One potential term project on this topic is to review the the lattice structure of partially ordered sets and explain how the monotonicity results derived in class can be extended to such sets following the exposition in Topkis (1998).
Another potential term project is to review monotonicity results for queueing models. There is a rich history on this topic but perhaps the most comprehensive survey is by Koole (2006).
Another potential term project is to present the generalized monotonicity results for optimal stopping. For example Oh and Özer (2016) presents sufficient conditions under which the optimal policy has a controlband type structure. Oh’s PhD thesis also has interesting results and examples on this topic.
1.5 Sensitivitybased analysis and eventbased optimization of MDPs
For infinite horizon MDPs, an alternative approach to value iteration and policy iteration methods studied in class is to investigate the sensitivity of the average performance to the parameters of the Markov policy. Such an approach is called perturbation analysis in the MDP literature and policygradient approach in the RL literature. The sensitivity based view of MDPs can be used to derive an eventbased approach to MDPs. The idea of the event based approach is to optimize based on “events” rather than on states.
One potential term project is to review the results on eventbased optimization following the suvery paper Cao (2005) or specific chapters from Cao (2007).
Another potential term project is to present a specific applications of eventbased optimization. Look at the references in Cao (2005); Cao (2007) for examples.
Another option is to review these ideas as used in the RL literature. For senstivitybased analysis (which is known as policygradient methods), see the recent survey by Agarwal et al. (2019).
1.6 Constrained MDPs.
We briefly talked about constrained MDPs in the notes on linear programming formulation. Potential term projects on this topic include:
A detailed discussion of the linear programming formulation of constrained MDPs, following the presentation of Altman (1999).
A discussion of the convex analytic approach of Borkar (1988). An accessible introduction to this approach is presented in Borkar (2002). The main idea of this approach is to think in terms of the occupancy measure of MDPs and cast the optimization problem as a convex programming problem in the space of probability measures.
1.7 Stochastic dual dynamic programming (SDDP)
SDDP uses ideas from convex optimization to efficiently solve MDPs over Euclean state space. The method originated in Pereira and Pinto (1991) and has been extensively used in the energy optimization sector. A potential term project on this topic is to review the results of Shapiro (2011).
1.8 Robust MDPs
Distributionally robust MDPs.
1.9 Numerical methods to efficiently solve POMDPs
See the remark at the end of the notes on POMDPs
1.10 Risk sensitive POMDPs
Baras et al., Fernandez et al.
1.11 Adaptive control of linear systems
See the chapter in Kumar and Varaiya (1986) for references.
1.12 Adaptive control of Markov chains.
See the chapter in Kumar and Varaiya (1986) for references.
1.13 Reinforcement learning
Value based methods (Qlearning, linear function approximation), policybased method (policy gradient methods, discussed above). Options (related to eventbased optimization).
1.14 Applications of MDPs
You could pick any topic on application of MDPs. Here I list some of the references which survey applications in different areas:
1.15 Applications of POMDPs
1.16 Applications of decentralized control systems.
Realtime communication: Witsenhausen, Walrand Varaiya, Teneketzis, Mahajan and Teneketzis.
Networked control systems: Walrand Varaiya, Mahajan and Teneketzis
Remote estimation: Lipsa and Martins (2011), Chakravorty and Mahajan (2020)
Meanfield teams: Arabneydi and Mahajan.
References
There are two classes of models that go under the heading of multiarmed bandits. One is the class of models that I have described here. The second are a different class of models which build on the work of Lai and Robbins (1985), which derived regret bounds for such models. These days the latter class of models are more popular, primarily because they have been used extensively in algorithms which display online advertisements.↩︎