Potential Project Topics

Updated

March 1, 2026

The project must be done in groups of two. The idea is to review a paper or book chapter corresponding to the material covered in class and give a 15 minute presentation and a 10-15 page report on the topic.

Below is a (non-exhaustive) list of potential project topics.

Variations of MDP models

  • Average cost MDPs

    We restricted attention to discounted cost setting in class. Another infinite horizon objective is the average cost setting. A potential term project on this topic would be to cover the derivation of the average cost dynamic program under weak accessibility conditions and discuss the variations of value iteration, policy iteration, and linear programming and a discussion of the vanishing discount approach. This is covered in all standard books:

    Kumar, P.R. and Varaiya, P. 1986. Stochastic systems: Estimation identification and adaptive control. Prentice Hall.

    Puterman, M.L. 2014. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons. DOI: 10.1002/9780470316887

    Bertsekas, D.P. 2011. Dynamic programming and optimal control. Athena Scientific. Available at: http://www.athenasc.com/dpbook.html.

    Sennott, L.I. 1999. Stochastic dynamic programming and the control of queueing systems. Wiley, New York, NY, USA.

  • Regularization in MDPs

    Regularization is commonly used in many reinforcement learning algorithms. A potential term project on this paper will be to discuss the overview presented in

    Geist, M. et al. 2019. A theory of regularized Markov decision processes. International conference on machine learning, PMLR, 2160–2169. Available at: https://proceedings.mlr.press/v97/geist19a.html

  • Event driven optimization

    In the models discussed in class, decisions need to be made at every time. One way to reduce the computational complexity is to consider the setting where decisions are made only when certain events occur. An overview of this approach is presented in

    Cao, X.-R. 2005. Basic ideas for event-based optimization of markov systems. Discrete Event Dynamic Systems 15, 2, 169–197. DOI: 10.1007/s10626-004-6211-4

    For a full treatment see

    Cao, X.-R. 2007. Stochastic learning and optimization. Springer. DOI: 10.1007/978-0-387-69082-7

  • Constrained MDPs

    In class, we briefly discussed constrained MDPs. A term project on this topic would be to provide complete details for the discounted case, using for example, Altman’s book:

    Altman, Eitan. 1999. Constrained Markov decision processes. CRC Press. Available at: http://www-sop.inria.fr/members/Eitan.Altman/TEMP/h.pdf

    A variation will be to provide an overview of what is called the convex analytic approach introduced in

    Borkar, V.S. 1988. A convex analytic approach to Markov decision processes. Probability Theory and Related Fields 78, 4, 583–602. DOI: 10.1007/bf00353877

    An accessible introduction is presented in

    Borkar, V.S. 2002. Convex analytic methods in Markov decision processes. In: International series in operations research & management science. Springer US, 347–375. DOI: 10.1007/978-1-4615-0805-2_11

  • Periodic MDPs

    In class, we restricted attention to time-homogeneous models. In practice, especially in applications like power systems, the dynamics are often periodic. A potential term project on this topic will include an overview of the main ideas of periodic MDPs introduced in

    Riis, J.O. 1965. Discounted Markov programming in a periodic process. Operations Research 13, 6, 920–929. DOI: 10.1287/opre.13.6.920

    An accessible overview is provided in

    Scherrer, B. 2016. On periodic markov decision processes. Available at: https://ewrl.files.wordpress.com/2016/12/scherrer.pdf. EWRL Tutorial.

  • Risk-sensitive MDPs

    In the course, we primarily restricted to risk neutral setup. In many settings, we are interested in risk sensitive utility function. A potential term project could be to review the formulation of risk sensitive MDPs (links to be added).

  • Distribution of returns

    In the course, we focused on the expected returns. In some settings, one is interested in the distribution of returns. Surprisingly, it is possible to write a dynamic programming for the distribution of returns. A potential term project on this topic could be to review Chapters 2 and 5 of the following book:

    Bellemare, M.G. et al. 2023. Distributional reinforcement learning. MIT Press. Available at: http://www.distributional-rl.org

Structural properties of MDPs

  • Establishing properties beyond monotonicity

    In the class, we studied sufficient conditions for value function and policy to be monotone. But what about other properties such as convexity?

    A general theory for when such structural properties can be established is presented in

    Smith, J.E. and McCardle, K.F. 2002. Structural properties of stochastic dynamic programs. Operations Research 50, 5, 796–809. DOI: 10.1287/opre.50.5.796.365

    A related result is the theory of comparative statistics, which establishes monotonicity in any model parameter (including discount factors)

    Light, B. 2021. Stochastic comparative statics in Markov decision processes. Mathematics of Operations Research 46, 2, 797–810. DOI: 10.1287/moor.2020.1086

    A specific example of these types of result are sufficient conditions under which optimal stopping problems have a “control band” type structure: continue as long as the state is within an interval, otherwise stop.

    Oh, S. and Özer, Ö. 2016. Characterizing the structure of optimal stopping policies. Production and Operations Management 25, 11, 1820–1838. DOI: 10.1111/poms.12579

  • Monotonicity for multi-dimensional MDPs

    In class, when discussing monotonicity results, we assumed that the state is one-dimensional. The general theory for multi-dimensional setting is presented in the following

    Topkis, D.M. 1998. Supermodularity and complementarity. Princeton University Press.

    These can be difficult to use. Easier to verify conditions (for a specific class of models) are presented in

    Papadaki, K. and Powell, W.B. 2007. Monotonicity in multidimensional markov decision processes for the batch dispatch problem. Operations Research Letters 35, 2, 267–272. DOI: 10.1016/j.orl.2006.03.013

    A systematic treatment for queueing models is presented in

    Koole, G. 2006. Monotonicity in markov reward and decision chains: Theory and applications. Foundations and Trends in Stochastic Systems 1, 1, 1–76. DOI: 10.1561/0900000002

  • Structural results for POMDPs

    In class, we only discussed structural results for MDPs. A general theory of POMDPs is preseted in part III of the following book:

    Krishnamurthy, V. 2025. Partially observed markov decision processes: Filtering, learning and controlled sensing. Cambridge University Press, Cambridge.

  • Multi-armed bandits

    Multi-armed bandits1 are a special class of resource allocation problems which have a low complexity solution. In general, scheduling problems can be modeled as MDPs, but the state space of such problems in exponential in the number of alternatives. This is sometimes referred to as the curse of dimensionality. A conceptual breakthrough in such problems was achieved when Gittins showed that a low complexity solution exists.

    Gittins, J.C. and Jones, D.M. 1974. A dynamic allocation index for the discounted multiarmed bandit problem. In: Progress in statistics. North-Holland, Amsterdam, Netherlands, 241–266.

    Gittins, J.C. 1979. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological) 41, 2, 148–177.

    A potential term project on this topic is to review the basic result of Gittins (1979). The proof in Gittins (1979) uses interchange argument, which we did not cover in the course. So the term project could cover the method of interchange arguments, and then explain Gittins result using that.

    Another alternative is to present the dynamic programming proof of Gittins result, which was presented in

    Whittle, P. 1980. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) 42, 2, 143–149.

    Since then multiple proofs of Gittins result have been presented in the literature. A survey of the four most popular results is presented in

    Frostig, E. and Weiss, G. 2016. Four proofs of Gittins’ multiarmed bandit theorem. Annals of Operations Research 241, 1, 127–165. DOI: 10.1007/s10479-013-1523-0

    Another potential term project is to present the restart-in-state formulation of the multi-armed problem presented in

    Katehakis, M.N. and Veinott, A.F. 1987. The multi-armed bandit problem: Decomposition and computation. Mathematics of Operations Research 12, 2, 262–268.

    An interesting generalization of this idea for multi-armed bandits with commitments is presented in

    Cowan, W. and Katehakis, M.N. 2015. Multi-armed bandits under general deprecation and commitment. Probability in the Engineering and Informational Sciences 29, 1, 51–76. DOI: 10.1017/s0269964814000217

  • Structure-aware reinforcement learning

    There are many papers which leverage the structural properties of MDPs to develop “structure-aware” RL algorithms, which in general converge much faster than general RL algorithm. A non-exhaustive list is given below:

    Subramanian, J. and Mahajan, A. 2020. Renewal monte carlo: Renewal theory-based reinforcement learning. IEEE Transactions on Automatic Control 65, 8, 3663–3670. DOI: 10.1109/tac.2019.2953089

    Provides an RL algorithm for MDPs where the Markov chain under the optimal policy has a renewal structure. A term project on this topic could include an overview of renewal theory for stochastic processes and tie that to the results presented in the above paper.

    Roy, A. et al. 2022. Online reinforcement learning of optimal threshold policies for Markov decision processes. IEEE Transactions on Automatic Control 67, 7, 3722–3729. DOI: 10.1109/tac.2021.3108121

    Provides variation of actor-critic algorithms for MDPs with a threshold structure.

    Jitani, A. et al. 2022. Structure-aware reinforcement learning for node-overload protection in mobile edge computing. IEEE Transactions on Cognitive Communications and Networking 8, 4, 1881–1897. DOI: 10.1109/tccn.2022.3195503

    An application of the results of Roy et al. (2022) for node overload protection in mobile edge computing.

    Fu, F. and Schaar, M. van der 2012. Structure-aware stochastic control for transmission scheduling. IEEE Transactions on Vehicular Technology 61, 9, 3931–3945. DOI: 10.1109/tvt.2012.2213850

    Provides a variation of Q-learning for a slight variation of the model presented in power-delay tradeoff

    Kunnumkal, S. and Topaloglu, H. 2008. Exploiting the structural properties of the underlying markov decision problem in the q-learning algorithm. INFORMS Journal on Computing 20, 2, 288–301. DOI: 10.1287/ijoc.1070.0240

    Proposes a variation of Q-learning algorithm for monotone MDPs.

1 There are two classes of models that go under the heading of multi-armed bandits. One is the class of models that I have described here. The second are a different class of models which build on the work of Lai and Robbins

Lai, T.L. and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 1, 4–22. DOI: http://dx.doi.org/10.1016/0196-8858(85)90002-8

which derived regret bounds for such models. These days the latter class of models are more popular, primarily because they have been used extensively in algorithms which display online advertisements.

MDP Algorithms

  • Stochastic dual dynamic programming (SDDP)

    SDDP is an efficient way to solve MDPs in continuous spaces using ideas from mathematical programming. An accessible overview is presented in

    Shapiro, A. 2011. Analysis of stochastic dual dynamic programming method. European Journal of Operational Research 209, 1, 63–72. DOI: 10.1016/j.ejor.2010.08.007

  • Information relaxation

    The idea behind information relaxation is to assume that we can relax the non-anticipativity constraints (so, we assume that we can observe the future random variables at a cost). Such a relaxation provides a lower bound on optimal performance. By appropriately tuning the cost of observation, one can obtain tighter bounds. An overview of the main theory is presented in this monograph:

    Brown, D.B. and Smith, J.E. 2022. Information relaxations and duality in stochastic dynamic programs: A review and tutorial. Foundations and Trends in Optimization 5, 3, 246–339. DOI: 10.1561/2400000027

  • Improved MDP algorithms

    It is possible to improve value iteration based on bounds on the optimal value function. An instance of this idea is presented in:

    Porteus, E.L. 1975. Bounds and transformations for discounted finite markov decision chains. Operations Research 23, 4, 761–784. DOI: 10.1287/opre.23.4.761

    Another instance is the reward balancing procedure presented in:

    Mustafin, A. et al. 2025. MDP geometry, normalization and reward balancing solvers. International conference on artificial intelligence and statistics, PMLR, 2476–2484. Available at: https://proceedings.mlr.press/v258/mustafin25a.html

    Another idea is to use operator splitting from linear algebra to modify value and policy iteration.

    Rakhsha, A. et al. 2022. Operator splitting value iteration. Advances in neural information processing systems, Curran Associates, Inc., 38373–38385. Available at: https://papers.nips.cc/paper_files/paper/2022/hash/fa809df3ec53cc5781e5078b7d500a5d-Abstract-Conference.html

  • Policy gradient algorithms

    A variation of policy iteration is using gradient descent on the policy parameters. The performance is usually not convex in the policy parameters, which makes it difficult to establish convergence of such algorithms. A comprehensive overview is provided in

    Agarwal, A. et al. 2020. Optimality and approximation with policy gradient methods in Markov decision processes. Proceedings of thirty third conference on learning theory, PMLR, 64–66. Available at: https://proceedings.mlr.press/v125/agarwal20a.html

    This is a long paper; a term project on this paper can focus on the tabular setting.

  • Algorithms for solving POMDPs

    See the remark at the end of the notes on POMDPs