Potential Project Topics
The project must be done in groups of two. The idea is to review a paper or book chapter corresponding to the material covered in class and give a 15 minute presentation and a 10-15 page report on the topic.
Below is a (non-exhaustive) list of potential project topics.
Variations of MDP models
Average cost MDPs
We restricted attention to discounted cost setting in class. Another infinite horizon objective is the average cost setting. A potential term project on this topic would be to cover the derivation of the average cost dynamic program under weak accessibility conditions and discuss the variations of value iteration, policy iteration, and linear programming and a discussion of the vanishing discount approach. This is covered in all standard books:
Kumar, P.R. and Varaiya, P. 1986. Stochastic systems: Estimation identification and adaptive control. Prentice Hall.
Puterman, M.L. 2014. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons. DOI: 10.1002/9780470316887
Bertsekas, D.P. 2011. Dynamic programming and optimal control. Athena Scientific. Available at: http://www.athenasc.com/dpbook.html.
Sennott, L.I. 1999. Stochastic dynamic programming and the control of queueing systems. Wiley, New York, NY, USA.
Regularization in MDPs
Regularization is commonly used in many reinforcement learning algorithms. A potential term project on this paper will be to discuss the overview presented in
Geist, M. et al. 2019. A theory of regularized Markov decision processes. International conference on machine learning, PMLR, 2160–2169. Available at: https://proceedings.mlr.press/v97/geist19a.html
Event driven optimization
In the models discussed in class, decisions need to be made at every time. One way to reduce the computational complexity is to consider the setting where decisions are made only when certain events occur. An overview of this approach is presented in
Cao, X.-R. 2005. Basic ideas for event-based optimization of markov systems. Discrete Event Dynamic Systems 15, 2, 169–197. DOI: 10.1007/s10626-004-6211-4
For a full treatment see
Cao, X.-R. 2007. Stochastic learning and optimization. Springer. DOI: 10.1007/978-0-387-69082-7
Constrained MDPs
In class, we briefly discussed constrained MDPs. A term project on this topic would be to provide complete details for the discounted case, using for example, Altman’s book:
Altman, Eitan. 1999. Constrained Markov decision processes. CRC Press. Available at: http://www-sop.inria.fr/members/Eitan.Altman/TEMP/h.pdf
A variation will be to provide an overview of what is called the convex analytic approach introduced in
Borkar, V.S. 1988. A convex analytic approach to Markov decision processes. Probability Theory and Related Fields 78, 4, 583–602. DOI: 10.1007/bf00353877
An accessible introduction is presented in
Borkar, V.S. 2002. Convex analytic methods in Markov decision processes. In: International series in operations research & management science. Springer US, 347–375. DOI: 10.1007/978-1-4615-0805-2_11
Periodic MDPs
In class, we restricted attention to time-homogeneous models. In practice, especially in applications like power systems, the dynamics are often periodic. A potential term project on this topic will include an overview of the main ideas of periodic MDPs introduced in
Riis, J.O. 1965. Discounted Markov programming in a periodic process. Operations Research 13, 6, 920–929. DOI: 10.1287/opre.13.6.920
An accessible overview is provided in
Scherrer, B. 2016. On periodic markov decision processes. Available at: https://ewrl.files.wordpress.com/2016/12/scherrer.pdf. EWRL Tutorial.
Risk-sensitive MDPs
In the course, we primarily restricted to risk neutral setup. In many settings, we are interested in risk sensitive utility function. A potential term project could be to review the formulation of risk sensitive MDPs (links to be added).
Distribution of returns
In the course, we focused on the expected returns. In some settings, one is interested in the distribution of returns. Surprisingly, it is possible to write a dynamic programming for the distribution of returns. A potential term project on this topic could be to review Chapters 2 and 5 of the following book:
Bellemare, M.G. et al. 2023. Distributional reinforcement learning. MIT Press. Available at: http://www.distributional-rl.org
Structural properties of MDPs
Establishing properties beyond monotonicity
In the class, we studied sufficient conditions for value function and policy to be monotone. But what about other properties such as convexity?
A general theory for when such structural properties can be established is presented in
Smith, J.E. and McCardle, K.F. 2002. Structural properties of stochastic dynamic programs. Operations Research 50, 5, 796–809. DOI: 10.1287/opre.50.5.796.365
A related result is the theory of comparative statistics, which establishes monotonicity in any model parameter (including discount factors)
Light, B. 2021. Stochastic comparative statics in Markov decision processes. Mathematics of Operations Research 46, 2, 797–810. DOI: 10.1287/moor.2020.1086
A specific example of these types of result are sufficient conditions under which optimal stopping problems have a “control band” type structure: continue as long as the state is within an interval, otherwise stop.
Oh, S. and Özer, Ö. 2016. Characterizing the structure of optimal stopping policies. Production and Operations Management 25, 11, 1820–1838. DOI: 10.1111/poms.12579
Monotonicity for multi-dimensional MDPs
In class, when discussing monotonicity results, we assumed that the state is one-dimensional. The general theory for multi-dimensional setting is presented in the following
Topkis, D.M. 1998. Supermodularity and complementarity. Princeton University Press.
These can be difficult to use. Easier to verify conditions (for a specific class of models) are presented in
Papadaki, K. and Powell, W.B. 2007. Monotonicity in multidimensional markov decision processes for the batch dispatch problem. Operations Research Letters 35, 2, 267–272. DOI: 10.1016/j.orl.2006.03.013
A systematic treatment for queueing models is presented in
Koole, G. 2006. Monotonicity in markov reward and decision chains: Theory and applications. Foundations and Trends in Stochastic Systems 1, 1, 1–76. DOI: 10.1561/0900000002
Structural results for POMDPs
In class, we only discussed structural results for MDPs. A general theory of POMDPs is preseted in part III of the following book:
Krishnamurthy, V. 2025. Partially observed markov decision processes: Filtering, learning and controlled sensing. Cambridge University Press, Cambridge.
Multi-armed bandits
Multi-armed bandits1 are a special class of resource allocation problems which have a low complexity solution. In general, scheduling problems can be modeled as MDPs, but the state space of such problems in exponential in the number of alternatives. This is sometimes referred to as the curse of dimensionality. A conceptual breakthrough in such problems was achieved when Gittins showed that a low complexity solution exists.
Gittins, J.C. and Jones, D.M. 1974. A dynamic allocation index for the discounted multiarmed bandit problem. In: Progress in statistics. North-Holland, Amsterdam, Netherlands, 241–266.
Gittins, J.C. 1979. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological) 41, 2, 148–177.
A potential term project on this topic is to review the basic result of Gittins (1979). The proof in Gittins (1979) uses interchange argument, which we did not cover in the course. So the term project could cover the method of interchange arguments, and then explain Gittins result using that.
Another alternative is to present the dynamic programming proof of Gittins result, which was presented in
Whittle, P. 1980. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) 42, 2, 143–149.
Since then multiple proofs of Gittins result have been presented in the literature. A survey of the four most popular results is presented in
Frostig, E. and Weiss, G. 2016. Four proofs of Gittins’ multiarmed bandit theorem. Annals of Operations Research 241, 1, 127–165. DOI: 10.1007/s10479-013-1523-0
Another potential term project is to present the restart-in-state formulation of the multi-armed problem presented in
Katehakis, M.N. and Veinott, A.F. 1987. The multi-armed bandit problem: Decomposition and computation. Mathematics of Operations Research 12, 2, 262–268.
An interesting generalization of this idea for multi-armed bandits with commitments is presented in
Cowan, W. and Katehakis, M.N. 2015. Multi-armed bandits under general deprecation and commitment. Probability in the Engineering and Informational Sciences 29, 1, 51–76. DOI: 10.1017/s0269964814000217
Structure-aware reinforcement learning
There are many papers which leverage the structural properties of MDPs to develop “structure-aware” RL algorithms, which in general converge much faster than general RL algorithm. A non-exhaustive list is given below:
Subramanian, J. and Mahajan, A. 2020. Renewal monte carlo: Renewal theory-based reinforcement learning. IEEE Transactions on Automatic Control 65, 8, 3663–3670. DOI: 10.1109/tac.2019.2953089
Provides an RL algorithm for MDPs where the Markov chain under the optimal policy has a renewal structure. A term project on this topic could include an overview of renewal theory for stochastic processes and tie that to the results presented in the above paper.
Roy, A. et al. 2022. Online reinforcement learning of optimal threshold policies for Markov decision processes. IEEE Transactions on Automatic Control 67, 7, 3722–3729. DOI: 10.1109/tac.2021.3108121
Provides variation of actor-critic algorithms for MDPs with a threshold structure.
Jitani, A. et al. 2022. Structure-aware reinforcement learning for node-overload protection in mobile edge computing. IEEE Transactions on Cognitive Communications and Networking 8, 4, 1881–1897. DOI: 10.1109/tccn.2022.3195503
An application of the results of Roy et al. (2022) for node overload protection in mobile edge computing.
Fu, F. and Schaar, M. van der 2012. Structure-aware stochastic control for transmission scheduling. IEEE Transactions on Vehicular Technology 61, 9, 3931–3945. DOI: 10.1109/tvt.2012.2213850
Provides a variation of Q-learning for a slight variation of the model presented in power-delay tradeoff
Kunnumkal, S. and Topaloglu, H. 2008. Exploiting the structural properties of the underlying markov decision problem in the q-learning algorithm. INFORMS Journal on Computing 20, 2, 288–301. DOI: 10.1287/ijoc.1070.0240
Proposes a variation of Q-learning algorithm for monotone MDPs.
1 There are two classes of models that go under the heading of multi-armed bandits. One is the class of models that I have described here. The second are a different class of models which build on the work of Lai and Robbins
Lai, T.L. and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 1, 4–22. DOI: http://dx.doi.org/10.1016/0196-8858(85)90002-8
which derived regret bounds for such models. These days the latter class of models are more popular, primarily because they have been used extensively in algorithms which display online advertisements.
MDP Algorithms
Stochastic dual dynamic programming (SDDP)
SDDP is an efficient way to solve MDPs in continuous spaces using ideas from mathematical programming. An accessible overview is presented in
Shapiro, A. 2011. Analysis of stochastic dual dynamic programming method. European Journal of Operational Research 209, 1, 63–72. DOI: 10.1016/j.ejor.2010.08.007
Information relaxation
The idea behind information relaxation is to assume that we can relax the non-anticipativity constraints (so, we assume that we can observe the future random variables at a cost). Such a relaxation provides a lower bound on optimal performance. By appropriately tuning the cost of observation, one can obtain tighter bounds. An overview of the main theory is presented in this monograph:
Brown, D.B. and Smith, J.E. 2022. Information relaxations and duality in stochastic dynamic programs: A review and tutorial. Foundations and Trends in Optimization 5, 3, 246–339. DOI: 10.1561/2400000027
Improved MDP algorithms
It is possible to improve value iteration based on bounds on the optimal value function. An instance of this idea is presented in:
Porteus, E.L. 1975. Bounds and transformations for discounted finite markov decision chains. Operations Research 23, 4, 761–784. DOI: 10.1287/opre.23.4.761
Another instance is the reward balancing procedure presented in:
Mustafin, A. et al. 2025. MDP geometry, normalization and reward balancing solvers. International conference on artificial intelligence and statistics, PMLR, 2476–2484. Available at: https://proceedings.mlr.press/v258/mustafin25a.html
Another idea is to use operator splitting from linear algebra to modify value and policy iteration.
Rakhsha, A. et al. 2022. Operator splitting value iteration. Advances in neural information processing systems, Curran Associates, Inc., 38373–38385. Available at: https://papers.nips.cc/paper_files/paper/2022/hash/fa809df3ec53cc5781e5078b7d500a5d-Abstract-Conference.html
Policy gradient algorithms
A variation of policy iteration is using gradient descent on the policy parameters. The performance is usually not convex in the policy parameters, which makes it difficult to establish convergence of such algorithms. A comprehensive overview is provided in
Agarwal, A. et al. 2020. Optimality and approximation with policy gradient methods in Markov decision processes. Proceedings of thirty third conference on learning theory, PMLR, 64–66. Available at: https://proceedings.mlr.press/v125/agarwal20a.html
This is a long paper; a term project on this paper can focus on the tabular setting.
Algorithms for solving POMDPs
See the remark at the end of the notes on POMDPs