ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2022

About  |  Lectures  |  Notes  |  Coursework

  • The report must not be any longer than 10-15 pages. The grade distribution for different parts of the report are as follows.

    • Background, presentation of the problem setting, and literature overview: 20%
    • Summary of the results, proof outlines, and examples to illustrate the main results: 50%
    • Critical evaluation of the contributions: 20%
    • Clarity of the presentation: 10%
  • The potential project topics are listed below. You can also pick up on the discussion of the literature in the reference section of the notes. You are also free to pick topics outside of these topics. Please send me an email with the topic that you want to work on, and we can finalize which papers should be covered in the term project.

  • You may wish to view some of the past term projects to get an idea of the size of structure of the project reports.

1 Potential Project topics

This is a partial list of topics. If you have a specific topic or direction in mind, please feel free to discuss that with me.

1.1 Interesting papers

These are a biased collection of interesting papers and could form the basis of a term project. For some of these, you might have to combine them with other follow up work to have substantial material for a term project.

1.2 Average cost MDPs

In the course, we primarily focused on discounted cost settings. A related setup is average cost setting.

A basic treatment of average cost MDPs with finite state and finite action is covered in all standard books: Kumar and Varaiya (1986), Puterman (2014), Bertsekas (2011). An accessible treatment for countable or uncountable state spaces is presented in Sennott (1999). A detailed survey of many of the technical results for average cost models over general state spaces is in Arapostathis et al. (1993).

1.3 Multi-armed bandits

Multi-armed bandits1 are a special class of resouce allocation problems which have a low complexity solution. In general, scheduling problems can be modeled as MDPs, but the state space of such problems in exponential in the number of alternatives. This is sometimes referred to as the curse of dimensionality. A conceptual breakthrough in such problem was achieved when Gittins and Jones (1974); Gittins (1979) showed that the a low complexity solution exists.

1.4 Monotonicity of optimal policies

When we studied monotonicity of optimal policies in class, we assumed that the state and action space were totally ordered. Some of these results generalize to models where the state and action spaces are not totally ordered. An interesting example of such a result is presented in Papadaki and Powell (2007). General conditions for the value function and optimal policy to be monotone are derived in Topkis (1998).

1.5 Sensitivity-based analysis and event-based optimization of MDPs

For infinite horizon MDPs, an alternative approach to value iteration and policy iteration methods studied in class is to investigate the sensitivity of the average performance to the parameters of the Markov policy. Such an approach is called perturbation analysis in the MDP literature and policy-gradient approach in the RL literature. The sensitivity based view of MDPs can be used to derive an event-based approach to MDPs. The idea of the event based approach is to optimize based on “events” rather than on states.

1.6 Constrained MDPs.

We briefly talked about constrained MDPs in the notes on linear programming formulation. Potential term projects on this topic include:

1.7 Stochastic dual dynamic programming (SDDP)

SDDP uses ideas from convex optimization to efficiently solve MDPs over Euclean state space. The method originated in Pereira and Pinto (1991) and has been extensively used in the energy optimization sector. A potential term project on this topic is to review the results of Shapiro (2011).

1.8 Robust MDPs

Distributionally robust MDPs.

1.9 Numerical methods to efficiently solve POMDPs

See the remark at the end of the notes on POMDPs

1.10 Risk sensitive POMDPs

Baras et al., Fernandez et al. 

1.11 Adaptive control of linear systems

See the chapter in Kumar and Varaiya (1986) for references.

1.12 Adaptive control of Markov chains.

See the chapter in Kumar and Varaiya (1986) for references.

1.13 Reinforcement learning

Value based methods (Q-learning, linear function approximation), policy-based method (policy gradient methods, discussed above). Options (related to event-based optimization).

1.14 Applications of MDPs

You could pick any topic on application of MDPs. Here I list some of the references which survey applications in different areas:

1.15 Applications of POMDPs

1.16 Applications of decentralized control systems.


Agarwal, A., Kakade, S.M., Lee, J.D., and Mahajan, G. 2019. Optimality and approximation with policy gradient methods in markov decision processes. Available at:
Altman, E. 2002. Applications of markov decision processes in communication networks. In: International series in operations research & management science. Springer US, 489–536. DOI: 10.1007/978-1-4615-0805-2_16.
Altman, Eitan. 1999. Constrained markov decision processes. CRC Press. Available at:
Arapostathis, A., Borkar, V.S., Fernandez-Gaucherand, E., Ghosh, M.K., and Marcus, S.I. 1993. Discrete-time controlled Markov processes with average cost criterion - a survey. SIAM Journal of Control and Optimization 31, 2, 282–344.
Bertsekas, D.P. 2011. Dynamic programming and optimal control. Athena Scientific. Available at:
Bertsekas, D.P. 2013. Abstract dynamic programming. Athena Scientific Belmont. Available at:
Borkar, V.S. 1988. A convex analytic approach to markov decision processes. Probability Theory and Related Fields 78, 4, 583–602. DOI: 10.1007/bf00353877.
Borkar, V.S. 2002. Convex analytic methods in markov decision processes. In: International series in operations research & management science. Springer US, 347–375. DOI: 10.1007/978-1-4615-0805-2_11.
Cao, X.-R. 2005. Basic ideas for event-based optimization of markov systems. Discrete Event Dynamic Systems 15, 2, 169–197. DOI: 10.1007/s10626-004-6211-4.
Cao, X.-R. 2007. Stochastic learning and optimization. Springer. DOI: 10.1007/978-0-387-69082-7.
Chakravorty, J. and Mahajan, A. 2020. Remote estimation over a packet-drop channel with markovian state. IEEE Transactions on Automatic Control 65, 5, 2016–2031. DOI: 10.1109/tac.2019.2926160.
Cowan, W. and Katehakis, M.N. 2015. Multi-armed bandits under general deprecation and commitment. Probability in the Engineering and Informational Sciences 29, 1, 51–76. DOI: 10.1017/s0269964814000217.
Frostig, E. and Weiss, G. 2016. Four proofs of Gittins’ multiarmed bandit theorem. Annals of Operations Research 241, 1, 127–165. DOI: 10.1007/s10479-013-1523-0.
Gittins, J.C. 1979. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological) 41, 2, 148–177.
Gittins, J.C. and Jones, D.M. 1974. A dynamic allocation index for the discounted multiarmed bandit problem. In: Progress in statistics. North-Holland, Amsterdam, Netherlands, 241–266.
Karatzas, I. and Sudderth, W.D. 2010. Two characterizations of optimality in dynamic programming. Applied Mathematics and Optimization 61, 3, 421–434. DOI: 10.1007/s00245-009-9093-x.
Katehakis, M.N. and Veinott, A.F. 1987. The multi-armed bandit problem: Decomposition and computation. Mathematics of Operations Research 12, 2, 262–268.
Koole, G. 2006. Monotonicity in markov reward and decision chains: Theory and applications. Foundations and Trends in Stochastic Systems 1, 1, 1–76. DOI: 10.1561/0900000002.
Kumar, P.R. and Varaiya, P. 1986. Stochastic systems: Estimation identification and adaptive control. Prentice Hall.
Lai, T.L. and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 1, 4–22. DOI:
Lamond, B.F. and Boukhtouta, A. 2002. Water reservoir applications of markov decision processes. In: International series in operations research & management science. Springer US, 537–558. DOI: 10.1007/978-1-4615-0805-2_17.
Lipsa, G.M. and Martins, N.C. 2011. Remote state estimation with communication costs for first-order LTI systems. IEEE Transactions on Automatic Control 56, 9, 2013–2025. DOI: 10.1109/tac.2011.2139370.
Mahajan, A. and Teneketzis, D. 2008. Foundations and applications of sensor management. In: Springer-Verlag, 121–151.
Oh, S. and Özer, Ö. 2016. Characterizing the structure of optimal stopping policies. Production and Operations Management 25, 11, 1820–1838. DOI: 10.1111/poms.12579.
Papadaki, K. and Powell, W.B. 2007. Monotonicity in multidimensional markov decision processes for the batch dispatch problem. Operations Research Letters 35, 2, 267–272. DOI: 10.1016/j.orl.2006.03.013.
Pereira, M.V.F. and Pinto, L.M.V.G. 1991. Multi-stage stochastic optimization applied to energy planning. Mathematical Programming 52, 1-3, 359–375. DOI: 10.1007/bf01582895.
Puterman, M.L. 2014. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons. DOI: 10.1002/9780470316887.
Sennott, L.I. 1999. Stochastic dynamic programming and the control of queueing systems. Wiley, New York, NY, USA.
Shapiro, A. 2011. Analysis of stochastic dual dynamic programming method. European Journal of Operational Research 209, 1, 63–72. DOI: 10.1016/j.ejor.2010.08.007.
Smith, J.E. and McCardle, K.F. 2002. Structural properties of stochastic dynamic programs. Operations Research 50, 5, 796–809. DOI: 10.1287/opre.50.5.796.365.
Topkis, D.M. 1998. Supermodularity and complementarity. Princeton University Press.
Whittle, P. 1980. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) 42, 2, 143–149.
Whittle, P. 1988. Restless bandits: Activity allocation in a changing world. Journal of applied probability 25, A, 287–298.

  1. There are two classes of models that go under the heading of multi-armed bandits. One is the class of models that I have described here. The second are a different class of models which build on the work of Lai and Robbins (1985), which derived regret bounds for such models. These days the latter class of models are more popular, primarily because they have been used extensively in algorithms which display online advertisements.↩︎