59 Grading rubric for the report

Author

Affiliation

Updated

May 11, 2026

The term project consists of two components: project presentation (scheduled 20 April, 2026 from 2pm to 5pm @ ARTS W20) worth 5% and a report (due 26 April 2026 worth

The report must be typeset in single-column format with 1 inch margin. Use 11pt font with single spacing. The expected length of the report is 12-15 pages, but you may submit a slightly shorter or slight longer report. However, excessively long reports will be penalized. The objective of the report is not to reproduce all the results in the paper, but to summarize and highlight the key results.

The report should use the following outline:

Introduction (5 marks)
- Explain the problem addressed in the chosen literature and why it is significant.
- Present a short summary of the solution approach and the primary results.
System Model (15 marks)
- Provide a precise description of the model or framework presented in the paper.
- Discuss how this model relates to the concepts and methods covered in this course.
- Explain the key conceptual or technical difficulty the authors aimed to resolve in analyzing this model.
Main results (30 marks)
- Clearly state the main result(s).
- Provide a complete proof or a sufficiently detailed outline. The explanation should be clear enough for a classmate to understand the mathematical or logical progression without needing the original paper.
- Identify the specific steps in the analysis that address the technical challenges highlighted in Section 2.
Illustrative examples (20 marks)
- Provide two or more examples that clearly illustrate the results.
- Provide the code used to generate these examples, accompanied by a README file explaining how to run it. Follow standard practices for specifying dependencies (e.g., a requirements file).
- Connect these examples back to the challenges identified in Section 2. Do the results effectively address those difficulties?
Conclusion (10 marks)
- Summarize the key results and their implications.
- Place the work in the context of the wider literature. Has this led to follow-up research or changed how these problems are approached?

In addition to these section, the following additional grades will be allocated:

Clarity of presentation (5 marks)

Use clear, concise language free of grammatical and typographical errors. All equations and figures must be properly formatted.
Critical analysis of the results (15 marks)

This grade evaluates the depth of the discussion across Sections 2, 3, and 4. It focuses on:
- An evaluation of the assumptions made in the paper. Are they restrictive? Are they realistic for the problem described?
- A discussion on the boundaries of the findings. Where does the theory break down, or what cases does the paper fail to cover?
- Does the chosen approach offer the most efficient solution, or are there obvious trade-offs in complexity or performance?

Potential Project Topics

The project must be done in groups of two. The idea is to review a paper or book chapter corresponding to the material covered in class and give a 15 minute presentation and a 10-15 page report on the topic.

Below is a (non-exhaustive) list of potential project topics.

59.0.1 Variations of MDP models

Average cost MDPs

We restricted attention to discounted cost setting in class. Another infinite horizon objective is the average cost setting. A potential term project on this topic would be to cover the derivation of the average cost dynamic program under weak accessibility conditions and discuss the variations of value iteration, policy iteration, and linear programming and a discussion of the vanishing discount approach. This is covered in all standard books:

Kumar, P.R. and Varaiya, P. 1986. Stochastic systems: Estimation identification and adaptive control. Prentice Hall.

Puterman, M.L. 2014. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons. DOI: 10.1002/9780470316887

Bertsekas, D.P. 2011. Dynamic programming and optimal control. Athena Scientific. Available at: http://www.athenasc.com/dpbook.html.

Sennott, L.I. 1999. Stochastic dynamic programming and the control of queueing systems. Wiley, New York, NY, USA.
Regularization in MDPs

Regularization is commonly used in many reinforcement learning algorithms. A potential term project on this paper will be to discuss the overview presented in

Geist, M. et al. 2019. A theory of regularized Markov decision processes. International conference on machine learning, PMLR, 2160–2169. Available at: https://proceedings.mlr.press/v97/geist19a.html
Event driven optimization

In the models discussed in class, decisions need to be made at every time. One way to reduce the computational complexity is to consider the setting where decisions are made only when certain events occur. An overview of this approach is presented in

Cao, X.-R. 2005. Basic ideas for event-based optimization of markov systems. Discrete Event Dynamic Systems 15, 2, 169–197. DOI: 10.1007/s10626-004-6211-4

For a full treatment see

Cao, X.-R. 2007. Stochastic learning and optimization. Springer. DOI: 10.1007/978-0-387-69082-7
Constrained MDPs

In class, we briefly discussed constrained MDPs. A term project on this topic would be to provide complete details for the discounted case, using for example, Altman’s book:

Altman, Eitan. 1999. Constrained Markov decision processes. CRC Press. Available at: http://www-sop.inria.fr/members/Eitan.Altman/TEMP/h.pdf

A variation will be to provide an overview of what is called the convex analytic approach introduced in

Borkar, V.S. 1988. A convex analytic approach to Markov decision processes. Probability Theory and Related Fields 78, 4, 583–602. DOI: 10.1007/bf00353877

An accessible introduction is presented in

Borkar, V.S. 2002. Convex analytic methods in Markov decision processes. In: International series in operations research & management science. Springer US, 347–375. DOI: 10.1007/978-1-4615-0805-2_11
Periodic MDPs

In class, we restricted attention to time-homogeneous models. In practice, especially in applications like power systems, the dynamics are often periodic. A potential term project on this topic will include an overview of the main ideas of periodic MDPs introduced in

Riis, J.O. 1965. Discounted Markov programming in a periodic process. Operations Research 13, 6, 920–929. DOI: 10.1287/opre.13.6.920

An accessible overview is provided in

Scherrer, B. 2016. On periodic markov decision processes. Available at: https://ewrl.files.wordpress.com/2016/12/scherrer.pdf. EWRL Tutorial.
Risk-sensitive MDPs

In the course, we primarily restricted to risk neutral setup. In many settings, we are interested in risk sensitive utility function. A potential term project could be to review the formulation of risk sensitive MDPs (links to be added).
Distribution of returns

In the course, we focused on the expected returns. In some settings, one is interested in the distribution of returns. Surprisingly, it is possible to write a dynamic programming for the distribution of returns. A potential term project on this topic could be to review Chapters 2 and 5 of the following book:

Bellemare, M.G. et al. 2023. Distributional reinforcement learning. MIT Press. Available at: http://www.distributional-rl.org

59.0.2 Structural properties of MDPs

Establishing properties beyond monotonicity

In the class, we studied sufficient conditions for value function and policy to be monotone. But what about other properties such as convexity?

A general theory for when such structural properties can be established is presented in

Smith, J.E. and McCardle, K.F. 2002. Structural properties of stochastic dynamic programs. Operations Research 50, 5, 796–809. DOI: 10.1287/opre.50.5.796.365

A related result is the theory of comparative statistics, which establishes monotonicity in any model parameter (including discount factors)

Light, B. 2021. Stochastic comparative statics in Markov decision processes. Mathematics of Operations Research 46, 2, 797–810. DOI: 10.1287/moor.2020.1086

A specific example of these types of result are sufficient conditions under which optimal stopping problems have a “control band” type structure: continue as long as the state is within an interval, otherwise stop.

Oh, S. and Özer, Ö. 2016. Characterizing the structure of optimal stopping policies. Production and Operations Management 25, 11, 1820–1838. DOI: 10.1111/poms.12579
Monotonicity for multi-dimensional MDPs

In class, when discussing monotonicity results, we assumed that the state is one-dimensional. The general theory for multi-dimensional setting is presented in the following

Topkis, D.M. 1998. Supermodularity and complementarity. Princeton University Press.

These can be difficult to use. Easier to verify conditions (for a specific class of models) are presented in

Papadaki, K. and Powell, W.B. 2007. Monotonicity in multidimensional markov decision processes for the batch dispatch problem. Operations Research Letters 35, 2, 267–272. DOI: 10.1016/j.orl.2006.03.013

A systematic treatment for queueing models is presented in

Koole, G. 2006. Monotonicity in markov reward and decision chains: Theory and applications. Foundations and Trends® in Stochastic Systems 1, 1, 1–76. DOI: 10.1561/0900000002
Structural results for POMDPs

In class, we only discussed structural results for MDPs. A general theory of POMDPs is preseted in part III of the following book:

Krishnamurthy, V. 2025. Partially observed markov decision processes: Filtering, learning and controlled sensing. Cambridge University Press, Cambridge.
Multi-armed bandits

Multi-armed bandits¹ are a special class of resource allocation problems which have a low complexity solution. In general, scheduling problems can be modeled as MDPs, but the state space of such problems in exponential in the number of alternatives. This is sometimes referred to as the curse of dimensionality. A conceptual breakthrough in such problems was achieved when Gittins showed that a low complexity solution exists.

Gittins, J.C. and Jones, D.M. 1974. A dynamic allocation index for the discounted multiarmed bandit problem. In: Progress in statistics. North-Holland, Amsterdam, Netherlands, 241–266.

Gittins, J.C. 1979. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological) 41, 2, 148–177.

A potential term project on this topic is to review the basic result of Gittins (1979). The proof in Gittins (1979) uses interchange argument, which we did not cover in the course. So the term project could cover the method of interchange arguments, and then explain Gittins result using that.

Another alternative is to present the dynamic programming proof of Gittins result, which was presented in

Whittle, P. 1980. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) 42, 2, 143–149.

Since then multiple proofs of Gittins result have been presented in the literature. A survey of the four most popular results is presented in

Frostig, E. and Weiss, G. 2016. Four proofs of Gittins’ multiarmed bandit theorem. Annals of Operations Research 241, 1, 127–165. DOI: 10.1007/s10479-013-1523-0

Another potential term project is to present the restart-in-state formulation of the multi-armed problem presented in

Katehakis, M.N. and Veinott, A.F. 1987. The multi-armed bandit problem: Decomposition and computation. Mathematics of Operations Research 12, 2, 262–268.

An interesting generalization of this idea for multi-armed bandits with commitments is presented in

Cowan, W. and Katehakis, M.N. 2015. Multi-armed bandits under general deprecation and commitment. Probability in the Engineering and Informational Sciences 29, 1, 51–76. DOI: 10.1017/s0269964814000217
Structure-aware reinforcement learning

There are many papers which leverage the structural properties of MDPs to develop “structure-aware” RL algorithms, which in general converge much faster than general RL algorithm. A non-exhaustive list is given below:

Subramanian, J. and Mahajan, A. 2020. Renewal monte carlo: Renewal theory-based reinforcement learning. IEEE Transactions on Automatic Control 65, 8, 3663–3670. DOI: 10.1109/tac.2019.2953089

Provides an RL algorithm for MDPs where the Markov chain under the optimal policy has a renewal structure. A term project on this topic could include an overview of renewal theory for stochastic processes and tie that to the results presented in the above paper.

Roy, A. et al. 2022. Online reinforcement learning of optimal threshold policies for Markov decision processes. IEEE Transactions on Automatic Control 67, 7, 3722–3729. DOI: 10.1109/tac.2021.3108121

Provides variation of actor-critic algorithms for MDPs with a threshold structure.

Jitani, A. et al. 2022. Structure-aware reinforcement learning for node-overload protection in mobile edge computing. IEEE Transactions on Cognitive Communications and Networking 8, 4, 1881–1897. DOI: 10.1109/tccn.2022.3195503

An application of the results of Roy et al. (2022) for node overload protection in mobile edge computing.

Fu, F. and Schaar, M. van der 2012. Structure-aware stochastic control for transmission scheduling. IEEE Transactions on Vehicular Technology 61, 9, 3931–3945. DOI: 10.1109/tvt.2012.2213850

Provides a variation of Q-learning for a slight variation of the model presented in power-delay tradeoff

Kunnumkal, S. and Topaloglu, H. 2008. Exploiting the structural properties of the underlying markov decision problem in the q-learning algorithm. INFORMS Journal on Computing 20, 2, 288–301. DOI: 10.1287/ijoc.1070.0240

Proposes a variation of Q-learning algorithm for monotone MDPs.

¹ There are two classes of models that go under the heading of multi-armed bandits. One is the class of models that I have described here. The second are a different class of models which build on the work of Lai and Robbins

Lai, T.L. and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 1, 4–22. DOI: http://dx.doi.org/10.1016/0196-8858(85)90002-8

which derived regret bounds for such models. These days the latter class of models are more popular, primarily because they have been used extensively in algorithms which display online advertisements.

59.0.3 MDP Algorithms

Stochastic dual dynamic programming (SDDP)

SDDP is an efficient way to solve MDPs in continuous spaces using ideas from mathematical programming. An accessible overview is presented in

Shapiro, A. 2011. Analysis of stochastic dual dynamic programming method. European Journal of Operational Research 209, 1, 63–72. DOI: 10.1016/j.ejor.2010.08.007
Information relaxation

The idea behind information relaxation is to assume that we can relax the non-anticipativity constraints (so, we assume that we can observe the future random variables at a cost). Such a relaxation provides a lower bound on optimal performance. By appropriately tuning the cost of observation, one can obtain tighter bounds. An overview of the main theory is presented in this monograph:

Brown, D.B. and Smith, J.E. 2022. Information relaxations and duality in stochastic dynamic programs: A review and tutorial. Foundations and Trends in Optimization 5, 3, 246–339. DOI: 10.1561/2400000027
Improved MDP algorithms

It is possible to improve value iteration based on bounds on the optimal value function. An instance of this idea is presented in:

Porteus, E.L. 1975. Bounds and transformations for discounted finite markov decision chains. Operations Research 23, 4, 761–784. DOI: 10.1287/opre.23.4.761

Another instance is the reward balancing procedure presented in:

Mustafin, A. et al. 2025. MDP geometry, normalization and reward balancing solvers. International conference on artificial intelligence and statistics, PMLR, 2476–2484. Available at: https://proceedings.mlr.press/v258/mustafin25a.html

Another idea is to use operator splitting from linear algebra to modify value and policy iteration.

Rakhsha, A. et al. 2022. Operator splitting value iteration. Advances in neural information processing systems, Curran Associates, Inc., 38373–38385. Available at: https://papers.nips.cc/paper_files/paper/2022/hash/fa809df3ec53cc5781e5078b7d500a5d-Abstract-Conference.html
Policy gradient algorithms

A variation of policy iteration is using gradient descent on the policy parameters. The performance is usually not convex in the policy parameters, which makes it difficult to establish convergence of such algorithms. A comprehensive overview is provided in

Agarwal, A. et al. 2020. Optimality and approximation with policy gradient methods in Markov decision processes. Proceedings of thirty third conference on learning theory, PMLR, 64–66. Available at: https://proceedings.mlr.press/v125/agarwal20a.html

This is a long paper; a term project on this paper can focus on the tabular setting.
Algorithms for solving POMDPs

See the remark at the end of the notes on POMDPs

Agarwal, A., Kakade, S.M., Lee, J.D., and Mahajan, G. 2020. Optimality and approximation with policy gradient methods in Markov decision processes. Proceedings of thirty third conference on learning theory, PMLR, 64–66. Available at: https://proceedings.mlr.press/v125/agarwal20a.html.

Altman, Eitan. 1999. Constrained Markov decision processes. CRC Press. Available at: http://www-sop.inria.fr/members/Eitan.Altman/TEMP/h.pdf.

Bellemare, M.G., Dabney, W., and Rowland, M. 2023. Distributional reinforcement learning. MIT Press. Available at: http://www.distributional-rl.org.

Bertsekas, D.P. 2011. Dynamic programming and optimal control. Athena Scientific. Available at: http://www.athenasc.com/dpbook.html.

Borkar, V.S. 1988. A convex analytic approach to Markov decision processes. Probability Theory and Related Fields 78, 4, 583–602. DOI: 10.1007/bf00353877.

Borkar, V.S. 2002. Convex analytic methods in Markov decision processes. In: International series in operations research & management science. Springer US, 347–375. DOI: 10.1007/978-1-4615-0805-2_11.

Brown, D.B. and Smith, J.E. 2022. Information relaxations and duality in stochastic dynamic programs: A review and tutorial. Foundations and Trends in Optimization 5, 3, 246–339. DOI: 10.1561/2400000027.

Cao, X.-R. 2005. Basic ideas for event-based optimization of markov systems. Discrete Event Dynamic Systems 15, 2, 169–197. DOI: 10.1007/s10626-004-6211-4.

Cao, X.-R. 2007. Stochastic learning and optimization. Springer. DOI: 10.1007/978-0-387-69082-7.

Cowan, W. and Katehakis, M.N. 2015. Multi-armed bandits under general deprecation and commitment. Probability in the Engineering and Informational Sciences 29, 1, 51–76. DOI: 10.1017/s0269964814000217.

Frostig, E. and Weiss, G. 2016. Four proofs of Gittins’ multiarmed bandit theorem. Annals of Operations Research 241, 1, 127–165. DOI: 10.1007/s10479-013-1523-0.

Fu, F. and Schaar, M. van der. 2012. Structure-aware stochastic control for transmission scheduling. IEEE Transactions on Vehicular Technology 61, 9, 3931–3945. DOI: 10.1109/tvt.2012.2213850.

Geist, M., Scherrer, B., and Pietquin, O. 2019. A theory of regularized Markov decision processes. International conference on machine learning, PMLR, 2160–2169. Available at: https://proceedings.mlr.press/v97/geist19a.html.

Gittins, J.C. 1979. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological) 41, 2, 148–177.

Gittins, J.C. and Jones, D.M. 1974. A dynamic allocation index for the discounted multiarmed bandit problem. In: Progress in statistics. North-Holland, Amsterdam, Netherlands, 241–266.

Jitani, A., Mahajan, A., Zhu, Z., Abou-Zeid, H., Fapi, E.T., and Purmehdi, H. 2022. Structure-aware reinforcement learning for node-overload protection in mobile edge computing. IEEE Transactions on Cognitive Communications and Networking 8, 4, 1881–1897. DOI: 10.1109/tccn.2022.3195503.

Katehakis, M.N. and Veinott, A.F. 1987. The multi-armed bandit problem: Decomposition and computation. Mathematics of Operations Research 12, 2, 262–268.

Koole, G. 2006. Monotonicity in markov reward and decision chains: Theory and applications. Foundations and Trends® in Stochastic Systems 1, 1, 1–76. DOI: 10.1561/0900000002.

Krishnamurthy, V. 2025. Partially observed markov decision processes: Filtering, learning and controlled sensing. Cambridge University Press, Cambridge.

Kumar, P.R. and Varaiya, P. 1986. Stochastic systems: Estimation identification and adaptive control. Prentice Hall.

Kunnumkal, S. and Topaloglu, H. 2008. Exploiting the structural properties of the underlying markov decision problem in the q-learning algorithm. INFORMS Journal on Computing 20, 2, 288–301. DOI: 10.1287/ijoc.1070.0240.

Lai, T.L. and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 1, 4–22. DOI: http://dx.doi.org/10.1016/0196-8858(85)90002-8.

Light, B. 2021. Stochastic comparative statics in Markov decision processes. Mathematics of Operations Research 46, 2, 797–810. DOI: 10.1287/moor.2020.1086.

Mustafin, A., Pakharev, A., Olshevsky, A., and Paschalidis, I. 2025. MDP geometry, normalization and reward balancing solvers. International conference on artificial intelligence and statistics, PMLR, 2476–2484. Available at: https://proceedings.mlr.press/v258/mustafin25a.html.

Oh, S. and Özer, Ö. 2016. Characterizing the structure of optimal stopping policies. Production and Operations Management 25, 11, 1820–1838. DOI: 10.1111/poms.12579.

Papadaki, K. and Powell, W.B. 2007. Monotonicity in multidimensional markov decision processes for the batch dispatch problem. Operations Research Letters 35, 2, 267–272. DOI: 10.1016/j.orl.2006.03.013.

Porteus, E.L. 1975. Bounds and transformations for discounted finite markov decision chains. Operations Research 23, 4, 761–784. DOI: 10.1287/opre.23.4.761.

Puterman, M.L. 2014. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons. DOI: 10.1002/9780470316887.

Rakhsha, A., Wang, A., Ghavamzadeh, M., and Farahmand, A. 2022. Operator splitting value iteration. Advances in neural information processing systems, Curran Associates, Inc., 38373–38385. Available at: https://papers.nips.cc/paper_files/paper/2022/hash/fa809df3ec53cc5781e5078b7d500a5d-Abstract-Conference.html.

Riis, J.O. 1965. Discounted Markov programming in a periodic process. Operations Research 13, 6, 920–929. DOI: 10.1287/opre.13.6.920.

Roy, A., Borkar, V., Karandikar, A., and Chaporkar, P. 2022. Online reinforcement learning of optimal threshold policies for Markov decision processes. IEEE Transactions on Automatic Control 67, 7, 3722–3729. DOI: 10.1109/tac.2021.3108121.

Scherrer, B. 2016. On periodic markov decision processes. Available at: https://ewrl.files.wordpress.com/2016/12/scherrer.pdf. EWRL Tutorial.

Sennott, L.I. 1999. Stochastic dynamic programming and the control of queueing systems. Wiley, New York, NY, USA.

Shapiro, A. 2011. Analysis of stochastic dual dynamic programming method. European Journal of Operational Research 209, 1, 63–72. DOI: 10.1016/j.ejor.2010.08.007.

Smith, J.E. and McCardle, K.F. 2002. Structural properties of stochastic dynamic programs. Operations Research 50, 5, 796–809. DOI: 10.1287/opre.50.5.796.365.

Subramanian, J. and Mahajan, A. 2020. Renewal monte carlo: Renewal theory-based reinforcement learning. IEEE Transactions on Automatic Control 65, 8, 3663–3670. DOI: 10.1109/tac.2019.2953089.

Topkis, D.M. 1998. Supermodularity and complementarity. Princeton University Press.

Whittle, P. 1980. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) 42, 2, 143–149.