References

Author

Affiliation

Updated

May 31, 2023

Adell, J.A. and Jodrá, P. 2006. Exact kolmogorov and total variation distances between some familiar discrete distributions. Journal of Inequalities and Applications 2006, 1–8. DOI: 10.1155/jia/2006/64307.

Afshari, M. and Mahajan, A. 2023. Decentralized linear quadratic systems with major and minor agents and non-gaussian noise. IEEE Transactions on Automatic Control 68, 8, 4666–4681. DOI: 10.1109/tac.2022.3210049.

Altman, Eitan. 1999. Constrained markov decision processes. CRC Press. Available at: http://www-sop.inria.fr/members/Eitan.Altman/TEMP/h.pdf.

Altman, E. and Nain, P. 1992. Closed-loop control with delayed information. ACM SIGMETRICS Performance Evaluation Review 20, 1, 193–204. DOI: 10.1145/149439.133106.

Arabneydi, J. and Mahajan, A. 2015. Reinforcement learning in decentralized stochastic control systems with partial history sharing. 2015 american control conference (ACC), IEEE. DOI: 10.1109/acc.2015.7172192.

Arabneydi, J. and Mahajan, A. 2016. Linear quadratic mean field teams: Optimal and approximately optimal decentralized solutions. Available at: https://arxiv.org/abs/1609.00056v2.

Arrow, K.J., Blackwell, D., and Girshick, M.A. 1949. Bayes and minimax solutions of sequential decision problems. Econometrica 17, 3/4, 213. DOI: 10.2307/1905525.

Arrow, K.J., Harris, T., and Marschak, J. 1952. Optimal inventory policy. Econometrica 20, 1, 250–272. DOI: 10.2307/1907830.

Arthur, W.B. 1994. Increasing returns and path dependence in the economy. University of Michigan Press. DOI: 10.3998/mpub.10029.

Artzrouni, M. 1986. On the convergence of infinite products of matrices. Linear Algebra and its Applications 74, 11–21. DOI: 10.1016/0024-3795(86)90112-6.

Asadi, K., Misra, D., and Littman, M. 2018. Lipschitz continuity in model-based reinforcement learning. Proceedings of the 35th international conference on machine learning, PMLR, 264–273. Available at: https://proceedings.mlr.press/v80/asadi18a.html.

Åström, K.J. 1970. Introduction to stochastic control theory. Dover.

Athans, M. 1971. The role and use of the stochastic linear-quadratic-gaussian problem in control system design. IEEE Transactions on Automatic Control 16, 6, 529–552. DOI: 10.1109/tac.1971.1099818.

Bach, F. and Moulines, E. 2011. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, Curran Associates, Inc. Available at: https://proceedings.neurips.cc/paper_files/paper/2011/file/40008b9a5380fcacce3976bf7c08af5b-Paper.pdf.

Bai, C.-Z., Katewa, V., Gupta, V., and Huang, Y.-F. 2015. A stochastic sensor selection scheme for sequential hypothesis testing with multiple sensors. IEEE transactions on signal processing 63, 14, 3687–3699.

Bander, J.L. and White, C.C. 1999. Markov decision processes with noise-corrupted and delayed state observations. Journal of the Operational Research Society 50, 6, 660–668. DOI: 10.1057/palgrave.jors.2600745.

Baras, J.S., Dorsey, A.J., and Makowski, A.M. 1984. Two competing queues with linear costs: The μc-rule is often optimal. Advances in Applied Probability 16, 1, 8–8. DOI: 10.1017/s000186780002187x.

Bellman, R. 1957. Dynamic programming. Princeton University Press.

Bellman, R., Glicksberg, I., and Gross, O. 1955. On the optimal inventory equation. Management Science 2, 1, 83–104. DOI: 10.1287/mnsc.2.1.83.

Berry, R.A. 2000. Power and delay trade-offs in fading channels. PhD thesis, Massachusetts Institute of Technology. Available at: https://dspace.mit.edu/handle/1721.1/9290.

Berry, R.A. 2013. Optimal power-delay tradeoffs in fading channels—small-delay asymptotics. IEEE Transactions on Information Theory 59, 6, 3939–3952. DOI: 10.1109/TIT.2013.2253194.

Berry, R.A. and Gallager, R.G. 2002. Communication over fading channels with delay constraints. IEEE Transactions on Information Theory 48, 5, 1135–1149. DOI: 10.1109/18.995554.

Berry, R., Modiano, E., and Zafer, M. 2012. Energy-efficient scheduling under delay constraints for wireless networks. Synthesis Lectures on Communication Networks 5, 2, 1–96. DOI: 10.2200/S00443ED1V01Y201208CNT011.

Bertsekas, D.P. 2011. Dynamic programming and optimal control. Athena Scientific. Available at: http://www.athenasc.com/dpbook.html.

Bertsekas, D.P. 2013. Abstract dynamic programming. Athena Scientific Belmont. Available at: https://web.mit.edu/dimitrib/www/abstractdp_MIT.html.

Bertsekas, D.P. and Tsitsiklis, J.N. 1996. Neuro-dynamic programming. Athena Scientific.

Bertsekas, D.P. and Tsitsiklis, J.N. 2000. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization 10, 3, 627–642. DOI: 10.1137/s1052623497331063.

Bhatia, R. 2015. Positive definite matrices. Princeton University Press, Princeton.

Billingsley, P. 2013. Convergence of probability measures. John Wiley & Sons.

Bitar, E., Poolla, K., Khargonekar, P., Rajagopal, R., Varaiya, P., and Wu, F. 2012. Selling random wind. Hawaii international conference on system sciences, IEEE, 1931–1937.

Blackwell, D. 1964. Memoryless strategies in finite-stage dynamic programming. The Annals of Mathematical Statistics 35, 2, 863–865. DOI: 10.1214/aoms/1177703586.

Blackwell, D. 1965. Discounted dynamic programming. The Annals of Mathematical Statistics 36, 1, 226–235. DOI: 10.1214/aoms/1177700285.

Blackwell, D. 1970. On stationary policies. Journal of the Royal Statistical Society. Series A (General) 133, 1, 33. DOI: 10.2307/2343810.

Blum, J.R. 1954. Multidimensional stochastic approximation methods. The Annals of Mathematical Statistics 25, 4, 737–744. DOI: 10.1214/aoms/1177728659.

Bogdan, K. and Więcek, M. 2022. Burkholder inequality by bregman divergence. Available at: http://arxiv.org/pdf/2103.06358v3.

Bohlin, T. 1970. Information pattern for linear discrete-time models with stochastic coefficients. IEEE Transactions on Automatic Control (TAC) 15, 1, 104–106.

Borkar, V.S. 2008. Stochastic approximation. Hindustan Book Agency. DOI: 10.1007/978-93-86279-38-5.

Borkar, V.S. and Meyn, S.P. 2000. The o.d.e. Method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 2, 447–469. DOI: 10.1137/s0363012997331639.

Borkar, V.S. and Soumyanatha, K. 1997. An analog scheme for fixed point computation. I. theory. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 44, 4, 351–355. DOI: 10.1109/81.563625.

Bozkurt, B., Mahajan, A., Nayyar, A., and Ouyang, Y. 2023. Weighted norm bounds in MDPs with unbounded per-step cost.

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. 2019. Exploration by random network distillation. International conference on learning representations. Available at: https://openreview.net/forum?id=H1lJJnR5Ym.

Burkholder, D.L. 1966. Martingale transforms. The Annals of Mathematical Statistics 37, 6, 1494–1504. DOI: 10.1214/aoms/1177699141.

Buyukkoc, C., Varaiya, P., and Walrand, J. 1985. The cμ rule revisited. Advances in Applied Probability 17, 1, 237–238. DOI: 10.2307/1427064.

Caines, P.E. 2018. Linear stochastic systems. SIAM, Philadelphia, PA. DOI: 10.1137/1.9781611974713.

Cassandra, A., Littman, M.L., and Zhang, N.L. 1997. Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. Proceedings of the thirteenth conference on uncertainty in artificial intelligence.

Cassandra, A.R., Kaelbling, L.P., and Littman, M.L. 1994. Acting optimally in partially observable stochastic domains. AAAI, 1023–1028.

Chakravorty, J. and Mahajan, A. 2018. Sufficient conditions for the value function and optimal strategy to be even and quasi-convex. IEEE Transactions on Automatic Control 63, 11, 3858–3864. DOI: 10.1109/TAC.2018.2800796.

Chang, J.T. 2007. Stochastic processes. Available at: http://www.stat.yale.edu/~pollard/Courses/251.spring2013/Handouts/Chang-notes.pdf.

Chen, H.-F. and Guo, L. 1991. Identification and stochastic adaptive control. Birkhäuser Boston. DOI: 10.1007/978-1-4612-0429-9.

Chen, X. 2017. L^#-convexity and its applications in operations. Frontiers of Engineering Management 4, 3, 283. DOI: 10.15302/j-fem-2017057.

Chen, Z., Maguluri, S.T., Shakkottai, S., and Shanmugam, K. 2020. Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Advances in neural information processing systems, Curran Associates, Inc., 8223–8234. Available at: https://proceedings.neurips.cc/paper_files/paper/2020/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf.

Cheng, H.-T. 1988. Algorithms for partially observable markov decision processes. PhD thesis, University of British Columbia, Vancouver, BC.

Clark, D.S. 1987. Short proof of a discrete Gronwall inequality. Discrete Applied Mathematics 16, 3, 279–281. DOI: 10.1016/0166-218x(87)90064-3.

Daley, D.J. 1968. Stochastically monotone markov chains. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 10, 4, 305–317. DOI: 10.1007/BF00531852.

Davis, M.H.A. 1979. Martingale methods in stochastic control. In: Stochastic control theory and stochastic differential systems. Springer-Verlag, 85–117. DOI: 10.1007/bfb0009377.

Davis, M.H.A. and Varaiya, P.P. 1972. Information states for linear stochastic systems. Journal of Mathematical Analysis and Applications 37, 2, 384–402.

DeGroot, M. 1970. Optimal statistical decisions. Wiley-Interscience, Hoboken, N.J.

Dellacherie, C. and Meyer, P.-A. 1982. Probabilities and potential B: Theory of martingales. North-Holland Mathematical Studies.

Devlin, S. 2014. Potential based reward shaping tutorial. Available at: http://www-users.cs.york.ac.uk/~devlin/presentations/pbrs-tut.pdf.

Devlin, S. and Kudenko, D. 2012. Dynamic potential-based reward shaping. Proceedings of the 11th international conference on autonomous agents and multiagent systems, International Foundation for Autonomous Agents; Multiagent Systems, 433–440.

Dibangoye, J.S., Amato, C., Buffet, O., and Charpillet, F. 2016. Optimally solving dec-POMDPs as continuous-state MDPs. Journal of Artificial Intelligence Research 55, 443–497. DOI: 10.1613/jair.4623.

Ding, N., Sadeghi, P., and Kennedy, R.A. 2016. On monotonicity of the optimal transmission policy in cross-layer adaptive m -QAM modulation. IEEE Transactions on Communications 64, 9, 3771–3785. DOI: 10.1109/TCOMM.2016.2590427.

Doob, J.L. 1971. What is a martingale? The American Mathematical Monthly 78, 5, 451. DOI: 10.2307/2317751.

Dorato, P. and Levis, A. 1971. Optimal linear regulators: The discrete-time case. IEEE Transactions on Automatic Control 16, 6, 613–620. DOI: 10.1109/tac.1971.1099832.

Dubins, L.E. and Savage, L.J. 2014. How to gamble if you must: Inequalities for stochastic processes. Dover Publications.

Durrett, R. 2019. Probability: Theory and examples. Cambridge University Press. DOI: 10.1017/9781108591034.

Dutta, M. and Singh, R. 2024. Optimal risk-sensitive scheduling policies for remote estimation of autoregressive markov processes. Available at: http://arxiv.org/pdf/2403.13898v1.

Dvoretzky, A., Kiefer, J., and Wolfowitz, J. 1953. On the optimal character of the (s, s) policy in inventory theory. Econometrica 21, 4, 586. DOI: 10.2307/1907924.

Edgeworth, F.Y. 1888. The mathematical theory of banking. Journal of the Royal Statistical Society 51, 1, 113–127. Available at: https://www.jstor.org/stable/2979084.

Elliott, R., Li, X., and Ni, Y.-H. 2013. Discrete time mean-field stochastic linear-quadratic optimal control problems. Automatica 49, 11, 3222–3233. DOI: 10.1016/j.automatica.2013.08.017.

Ellis, R.S. 1985. Entropy, large deviations, and statistical mechanics. Springer New York. DOI: 10.1007/978-1-4613-8533-2.

Feinberg, E.A. 2005. On essential information in sequential decision processes. Mathematical Methods of Operations Research 62, 3, 399–410. DOI: 10.1007/s00186-005-0035-3.

Feinberg, E.A. 2016. Optimality conditions for inventory control. In: Optimization challenges in complex, networked and risky systems. INFORMS, 14–45. DOI: 10.1287/educ.2016.0145.

Feinberg, E.A. and He, G. 2020. Complexity bounds for approximately solving discounted MDPs by value iterations. Operations Research Letters. DOI: 10.1016/j.orl.2020.07.001.

Ferguson, T.S. 1989. Who solved the secretary problem? Statistical science, 282–289.

Ferguson, T.S. and Gilstein, C.Z. 2004. Optimal investment policies for the horse race model". Available at: https://www.math.ucla.edu/~tom/papers/unpublished/Zach2.pdf.

Föllmer, H. and Schied, A. 2010. Convex risk measures. In: Encyclopedia of quantitative finance. American Cancer Society. DOI: 10.1002/9780470061602.eqf15003.

Freeman, P.R. 1983. The secretary problem and its extensions: A review. International Statistical Review / Revue Internationale de Statistique 51, 2, 189. DOI: 10.2307/1402748.

Fu, F. and Schaar, M. van der. 2012. Structure-aware stochastic control for transmission scheduling. IEEE Transactions on Vehicular Technology 61, 9, 3931–3945. DOI: 10.1109/tvt.2012.2213850.

Fu, M.C. 2018. Monte carlo tree search: A tutorial. 2018 winter simulation conference (WSC), IEEE. DOI: 10.1109/wsc.2018.8632344.

Gao, S. and Mahajan, A. 2022. Optimal control of network-coupled subsystems: Spectral decomposition and low-dimensional solutions. IEEE Transactions on Control of Network Systems 9, 2, 657–669. DOI: 10.1109/tcns.2021.3124259.

Geiss, S. and Scheutzow, M. 2021. Sharpness of Lenglart’s domination inequality and a sharp monotone version. Electronic Communications in Probability 26, none, 1–8. DOI: 10.1214/21-ECP413.

Geist, M., Scherrer, B., and Pietquin, O. 2019. A theory of regularized Markov decision processes. Proceedings of the 36th international conference on machine learning, PMLR, 2160–2169. Available at: https://proceedings.mlr.press/v97/geist19a.html.

Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M.G. 2019. DeepMDP: Learning continuous latent space models for representation learning. Proceedings of the 36th international conference on machine learning, PMLR, 2170–2179. Available at: http://proceedings.mlr.press/v97/gelada19a.html.

Gladyshev, E.G. 1965. On stochastic approximation. Theory of Probability and Its Applications 10, 2, 275–278. DOI: 10.1137/1110031.

Grzes, M. and Kudenko, D. 2009. Theoretical and empirical analysis of reward shaping in reinforcement learning. International conference on machine learning and applications, 337–344. DOI: 10.1109/ICMLA.2009.33.

Hardy, G.H., Littlewood, J.E., and Pólya, G. 1952. Inequalities. Cambridge University Press.

Harris, F.W. 1913. How many parts to make at once. The magazine of management 10, 2, 135–152. DOI: 10.1287/opre.38.6.947.

Hay, N., Russell, S., Tolpin, D., and Shimony, S.E. 2012. Selecting computations: Theory and applications. UAI. Available at: http://www.auai.org/uai2012/papers/123.pdf.

Hernandez-Hernández, D. and Marcus, S.I. 1996. Risk sensitive control of markov processes in countable state space. Systems & Control Letters 29, 3, 147–155. DOI: 10.1016/s0167-6911(96)00051-5.

Hernández-Hernández, D. 1999. Existence of risk-sensitive optimal stationary policies for controlled markov processes. Applied Mathematics and Optimization 40, 3, 273–285. DOI: 10.1007/s002459900126.

Hernández-Lerma, O. and Lasserre, J.B. 1996. Discrete-time markov control processes. Springer New York. DOI: 10.1007/978-1-4612-0729-0.

Hernández-Lerma, O. and Lasserre, J.B. 1999. Further topics on discrete-time markov control processes. Springer New York. DOI: 10.1007/978-1-4612-0561-6.

Hinderer, K. 2005. Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research 62, 1, 3–22. DOI: 10.1007/s00186-005-0438-1.

Hopcroft, J. and Kannan, R. 2012. Computer science theory for the information age. Available at: https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/hopcroft-kannan-feb2012.pdf.

Howard, R.A. 1960. Dynamic programming and markov processes. The M.I.T. Press.

Howard, R.A. and Matheson, J.E. 1972. Risk-sensitive markov decision processes. Management Science 18, 7, 356–369. DOI: 10.1287/mnsc.18.7.356.

Hu, J. and Fu, M.C. 2025. Technical note—on the convergence rate of stochastic approximation for gradient-based stochastic optimization. Operations Research 73, 2, 1143–1150. DOI: 10.1287/opre.2023.0055.

Hu, J., Song, M., and Fu, M.C. 2025. Quantile optimization via multiple-timescale local search for black-box functions. Operations Research 73, 3, 1535–1557. DOI: 10.1287/opre.2022.0534.

Jaakkola, T., Jordan, M.I., and Singh, S.P. 1994. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6, 6, 1185–1201. DOI: 10.1162/neco.1994.6.6.1185.

Jenner, E., Hoof, H. van, and Gleave, A. 2022. Calculus on MDPs: Potential shaping as a gradient. Available at: https://arxiv.org/abs/2208.09570v1.

Kalman, R.E. 1960. Contributions to the theory of optimal control. Boletin de la Sociedad Matematica Mexicana 5, 102–119.

Kara, A.D. and Yüksel, S. 2022. Convergence of finite memory Q learning for POMDPs and near optimality of learned policies under filter stability. Mathematics of Operations Research. DOI: 10.1287/moor.2022.1331.

Karandikar, R.L. and Vidyasagar, M. 2024. Convergence rates for stochastic approximation: Biased noise with unbounded variance, and applications. Journal of Optimization Theory and Applications 203, 3, 2412–2450. DOI: 10.1007/s10957-024-02547-7.

Karatzas, I. and Sudderth, W.D. 2010. Two characterizations of optimality in dynamic programming. Applied Mathematics and Optimization 61, 3, 421–434. DOI: 10.1007/s00245-009-9093-x.

Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. 2019. Non-asymptotic analysis of biased stochastic approximation scheme. Proceedings of the thirty-second conference on learning theory, PMLR, 1944–1974. Available at: https://proceedings.mlr.press/v99/karimi19a.html.

Keilson, J. and Kester, A. 1977. Monotone matrices and monotone markov processes. Stochastic Processes and their Applications 5, 3, 231–241.

Kelly, J.L., Jr. 1956. A new interpretation of information rate. Bell System Technical Journal 35, 4, 917–926. DOI: 10.1002/j.1538-7305.1956.tb03809.x.

Kennerly, S. 2011. A graphical derivation of the legendre transform. Available at: http://einstein.drexel.edu/~skennerly/maths/Legendre.pdf.

Koole, G. 2006. Monotonicity in markov reward and decision chains: Theory and applications. Foundations and Trends in Stochastic Systems 1, 1, 1–76. DOI: 10.1561/0900000002.

Kuhn, H.W. 1950. Extensive games. Proceedings of the National Academy of Sciences 36, 10, 570–576. DOI: 10.1073/pnas.36.10.570.

Kuhn, H.W. 1953. Extensive games and the problem of information. In: H.W. Kuhn and A.W. Tucker, eds., Contributions to the theory of games. Princeton University Press, 193–216.

Kumar, P.R. and Varaiya, P. 1986. Stochastic systems: Estimation identification and adaptive control. Prentice Hall.

Kunnumkal, S. and Topaloglu, H. 2008. Exploiting the structural properties of the underlying markov decision problem in the q-learning algorithm. INFORMS Journal on Computing 20, 2, 288–301. DOI: 10.1287/ijoc.1070.0240.

Kushner, H.J. and Yin, G.G. 1997. Stochastic approximation algorithms and applications. Springer New York. DOI: 10.1007/978-1-4899-2696-8.

Kwakernaak, H. 1965. Theory of self-adaptive control systems. In: Springer, 14–18.

Lai, T.L. 2003. Stochastic approximation: Invited paper. The Annals of Statistics 31, 2. DOI: 10.1214/aos/1051027873.

Lenglart, É. 1977. Relation de domination entre deux processus. Annales de l’institut henri poincaré. Section b. Calcul des probabilités et statistiques, 171–179.

Levy, H. 1992. Stochastic dominance and expected utility: Survey and analysis. Management Science 38, 4, 555–593. DOI: 10.1287/mnsc.38.4.555.

Levy, H. 2015. Stochastic dominance: Investment decision making under uncertainty. Springer. DOI: 10.1007/978-3-319-21708-6.

Lewis, F.L., Vrabie, D., and Syrmos, V.L. 2012. Optimal control. John Wiley & Sons.

Lieb, E.H. 1973. Convex trace functions and the Wigner-Yanase-Dyson conjecture. Advances in Mathematics 11, 3, 267–288. DOI: 10.1016/0001-8708(73)90011-x.

Lindley, D.V. 1961. Dynamic programming and decision theory. Applied Statistics 10, 1, 39. DOI: 10.2307/2985407.

Lu, X., Roy, B.V., Dwaracherla, V., Ibrahimi, M., Osband, I., and Wen, Z. 2023. Reinforcement learning, bit by bit. Foundations and Trends in Machine Learning 16, 6, 733–865. DOI: 10.1561/2200000097.

Mahajan, A. 2008. Sequential decomposition of sequential dynamic teams: Applications to real-time communication and networked control systems. PhD thesis, University of Michigan, Ann Arbor, MI.

Mahajan, A., Niculescu, S.-I., and Vidyasagar, M. 2024. A vector almost-sure supermartingale theorem and its applications. In: IEEE conference on decision and control. IEEE.

Marshall, A.W., Olkin, I., and Arnold, B.C. 2011. Inequalities: Theory of majorization and its applications. Springer New York. DOI: 10.1007/978-0-387-68276-1.

Mazliak, L. and Shafer, G., eds. 2022. The splendors and miseries of martingales: Their history from the casino to mathematics. Springer International Publishing. DOI: 10.1007/978-3-031-05988-9.

Morse, P. and Kimball, G. 1951. Methods of operations research. Technology Press of MIT.

Müller, A. 1997a. Integral probability metrics and their generating classes of functions. Advances in Applied Probability 29, 2, 429–443. DOI: 10.2307/1428011.

Müller, A. 1997b. How does the value function of a markov decision process depend on the transition probabilities? Mathematics of Operations Research 22, 4, 872–885. DOI: 10.1287/moor.22.4.872.

Murota, K. 1998. Discrete convex analysis. Mathematical Programming 83, 1–3, 313–371. DOI: 10.1007/bf02680565.

Nain, P., Tsoucas, P., and Walrand, J. 1989. Interchange arguments in stochastic scheduling. Journal of Applied Probability 26, 4, 815–826. DOI: 10.2307/3214386.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19, 4, 1574–1609. DOI: 10.1137/070704277.

Nerode, A. 1958. Linear automaton transformations. Proceedings of American Mathematical Society 9, 541–544.

Neveu, J. 1975. Discrete parameter martingales. North Holland.

Ng, A.Y., Harada, D., and Russell, S. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. ICML, 278–287. Available at: http://aima.eecs.berkeley.edu/~russell/papers/icml99-shaping.pdf.

Norris, J.R. 1998. Markov chains. Cambridge university press.

Oh, S. and Özer, Ö. 2016. Characterizing the structure of optimal stopping policies. Production and Operations Management 25, 11, 1820–1838. DOI: 10.1111/poms.12579.

Picard, J. 2007. Concentration inequalities and model selection. Springer Berlin Heidelberg. DOI: 10.1007/978-3-540-48503-2.

Piunovskiy, A.B. 2011. Examples in markov decision processes. Imperial College Proess. DOI: 10.1142/p809.

Pollard, D. 2002. A user’s guide to measure theoretic probability. Cambridge University Press.

Pomatto, L., Strack, P., and Tamuz, O. 2020. Stochastic dominance under independent noise. Journal of Political Economy 128, 5, 1877–1900. DOI: 10.1086/705555.

Porteus, E.L. 1975. Bounds and transformations for discounted finite markov decision chains. Operations Research 23, 4, 761–784. DOI: 10.1287/opre.23.4.761.

Porteus, E.L. 2008. Building intuition: Insights from basic operations management models and principles. In: D. Chhajed and T.J. Lowe, eds., Springer, 115–134. DOI: 10.1007/978-0-387-73699-0.

Puterman, M.L. 2014. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons. DOI: 10.1002/9780470316887.

Qin, Y., Cao, M., and Anderson, B.D.O. 2020. Lyapunov criterion for stochastic systems and its applications in distributed computation. IEEE Transactions on Automatic Control 65, 2, 546–560. DOI: 10.1109/tac.2019.2910948.

Rachelson, E. and Lagoudakis, M.G. 2010. On the locality of action domination in sequential decision making. Proceedings of 11th international symposium on artificial intelligence and mathematics. Available at: https://oatao.univ-toulouse.fr/17977/.

Rachev, S.T. 1991. Probability metrics and the stability of stochastic models. Wiley, New York.

Raginsky, M. and Sason, I. 2013. Concentration of measure inequalities in information theory, communications, and coding. Foundations and Trends in Communications and Information Theory 10, 1-2, 1–246. DOI: 10.1561/0100000064.

Rigollet, P. 2015. High-dimensional statistics. Available at: https://ocw.mit.edu/courses/mathematics/18-s997-high-dimensional-statistics-spring-2015/lecture-notes/.

Riis, J.O. 1965. Discounted Markov programming in a periodic process. Operations Research 13, 6, 920–929. DOI: 10.1287/opre.13.6.920.

Rivasplata, O. 2012. Subgaussian random variables: An expository note. Available at: http://stat.cmu.edu/~arinaldo/36788/subgaussians.pdf.

Robbins, H. and Monro, S. 1951. A stochastic approximation method. The Annals of Mathematical Statistics 22, 3, 400–407. DOI: 10.1214/aoms/1177729586.

Robbins, H. and Siegmund, D. 1971. A convergence theorem for non-negative almost supermartingales and some applications. In: Optimizing methods in statistics. Elsevier, 233–257. DOI: 10.1016/b978-0-12-604550-5.50015-8.

Rockafellar, R.T. and Wets, R.J.-B. 2009. Variational analysis. Springer Science & Business Media.

Ross, S.M. 1974. Dynamic programming and gambling models. Advances in Applied Probability 6, 3, 593–606. DOI: 10.2307/1426236.

Roy, A., Borkar, V., Karandikar, A., and Chaporkar, P. 2022. Online reinforcement learning of optimal threshold policies for Markov decision processes. IEEE Transactions on Automatic Control 67, 7, 3722–3729. DOI: 10.1109/tac.2021.3108121.

Saldi, N., Linder, T., and Yüksel, S. 2018. Finite approximations in discrete-time stochastic control. Springer International Publishing. DOI: 10.1007/978-3-319-79033-6.

Sandell, J., Nils R. 1974. Control of finite-state, finite-memory stochastic systems. PhD thesis, Massachussets Institute of Technology, Cambridge, MA.

Sayedana, B. and Mahajan, A. 2020. Counterexamples on the monotonicity of delay optimal strategies for energy harvesting transmitters. IEEE Wireless Communications Letters, 1–1. DOI: 10.1109/lwc.2020.2981066.

Sayedana, B., Mahajan, A., and Yeh, E. 2020. Cross-layer communication over fading channels with adaptive decision feedback. International symposium on modeling and optimization in mobile, ad hoc, and wireless networks (WiOPT), 1–8.

Scarf, H. 1960. Mathematical methods in social sciences. In: J. Arrow S. Karlin and P. Suppes, eds., Stanford University Press, Stanford CA, 49–56. Available at: http://dido.wss.yale.edu/~hes/pub/ss-policies.pdf.

Scherrer, B. 2016. On periodic markov decision processes. Available at: https://ewrl.files.wordpress.com/2016/12/scherrer.pdf.

Serfozo, R.F. 1976. Monotone optimal policies for markov decision processes. In: Mathematical programming studies. Springer Berlin Heidelberg, 202–215. DOI: 10.1007/bfb0120752.

Shamir, O. 2011. A variant of azuma’s inequality for martingales with subgaussian tails. Available at: http://arxiv.org/abs/1110.2392.

Shebrawai, K. and Albadawani, H. 2012. Trace inequalities for matrices. Bulletin of the Australian Mathematical Society 87, 1, 139–148. DOI: 10.1017/s0004972712000627.

Shwartz, A. 2001. Death and discounting. IEEE Transactions on Automatic Control 46, 4, 644–647. DOI: 10.1109/9.917668.

Simon, H.A. 1956. Dynamic programming under uncertainty with a quadratic criterion function. Econometrica 24, 1, 74–81. DOI: 10.2307/1905261.

Singh, S.P. and Yee, R.C. 1994. An upper bound on the loss from approximate optimal-value functions. Machine Learning 16, 3, 227–233. DOI: 10.1007/bf00993308.

Sinha, A. and Mahajan, A. 2024. On the sensitivity of restless bandit solutions to uncertainty in the model of the arms.

Skinner, B.F. 1938. Behavior of organisms. Appleton-Century.

Smallwood, R.D. and Sondik, E.J. 1973. The optimal control of partially observable markov processes over a finite horizon. Operations Research 21, 5, 1071–1088. DOI: 10.1287/opre.21.5.1071.

Smith, J.E. and McCardle, K.F. 2002. Structural properties of stochastic dynamic programs. Operations Research 50, 5, 796–809. DOI: 10.1287/opre.50.5.796.365.

Sobel, M.J. 2012. Discounting axioms imply risk neutrality. Annals of Operations Research 208, 1, 417–432. DOI: 10.1007/s10479-012-1066-9.

Stout, W.F. 1974. Almost sure convergence. Academic Press.

Striebel, C. 1965. Sufficient statistics in the optimal control of stochastic systems. Journal of Mathematical Analysis and Applications 12, 576–592.

Strusevich, V.A. and Rustogi, K. 2016. Pairwise interchange argument and priority rules. In: Scheduling with time-changing effects and rate-modifying activities. Springer International Publishing, 19–36. DOI: 10.1007/978-3-319-39574-6_2.

Subramanian, J., Sinha, A., Seraj, R., and Mahajan, A. 2022. Approximate information state for approximate planning and reinforcement learning in partially observed systems. Journal of Machine Learning Research 23, 12, 1–83. Available at: http://jmlr.org/papers/v23/20-1165.html.

Sutton, R.S. and Barto, A.G. 2018. Reinforcement learning: An introduction. MIT Press.

Taylor, H.M. 1967. Evaluating a call option and optimal timing strategy in the stock market. Management Science 14, 1, 111–120. Available at: http://www.jstor.org/stable/2628546.

Theil, H. 1954. Econometric models and welfare maximization. Wirtschaftliches Archiv 72, 60–83. DOI: 10.1007/978-94-011-2410-2_1.

Theil, H. 1957. A note on certainty equivalence in dynamic planning. Econometrica, 346–349. DOI: 10.1007/978-94-011-2410-2_3.

Topkis, D.M. 1998. Supermodularity and complementarity. Princeton University Press.

Trench, W.F. 1999. Invertibly convergent infinite products of matrices. Journal of Computational and Applied Mathematics 101, 1–2, 255–263. DOI: 10.1016/s0377-0427(98)00191-5.

Tropp, J. 2011. Freedman’s inequality for matrix martingales. Electronic Communications in Probability 16, none. DOI: 10.1214/ecp.v16-1624.

Tsitsiklis, J.N. 1984. Periodic review inventory systems with continuous demand and discrete order sizes. Management Science 30, 10, 1250–1254. DOI: 10.1287/mnsc.30.10.1250.

Tsitsiklis, J.N. 1994. Asynchronous stochastic approximation and q-learning. Machine Learning 16, 3, 185–202. DOI: 10.1007/bf00993306.

Tsitsiklis, J.N. and Roy, B. van. 1996. Feature-based methods for large scale dynamic programming. Machine Learning 22, 1-3, 59–94. DOI: 10.1007/bf00114724.

Urgaonkar, R., Wang, S., He, T., Zafer, M., Chan, K., and Leung, K.K. 2015. Dynamic service migration and workload scheduling in edge-clouds. Performance Evaluation 91, 205–228. DOI: 10.1016/j.peva.2015.06.013.

Veinott, A.F. 1965. The optimal inventory policy for batch ordering. Operations Research 13, 3, 424–432. DOI: 10.1287/opre.13.3.424.

Veinott, A.F., Jr. 1966. On the opimality of (s,s) inventory policies: New conditions and a new proof. SIAM Journal on Applied Mathematics 14, 5, 1067–1083. DOI: 10.1137/0114086.

Vidyasagar, M. 2023. Convergence of stochastic approximation via martingale and converse Lyapunov methods. Mathematics of Control, Signals, and Systems 35, 2, 351–374. DOI: 10.1007/s00498-023-00342-9.

Villani, C. et al. 2008. Optimal transport: Old and new. Springer.

Wainwright, M.J. 2019. High-dimensional statistics. Cambridge University Press. DOI: 10.1017/9781108627771.

Wald, A. 1945. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16, 2, 117–186. DOI: 10.1214/aoms/1177731118.

Wald, A. and Wolfowitz, J. 1948. Optimum character of the sequential probability ratio test. The Annals of Mathematical Statistics 19, 3, 326–339. DOI: 10.1214/aoms/1177730197.

Walrand, J. 1988. An introduction to queueing networks. Prentice Hall.

Wang, S., Urgaonkar, R., Zafer, M., He, T., Chan, K., and Leung, K.K. 2019. Dynamic service migration in mobile edge computing based on Markov decision process. IEEE/ACM Transactions on Networking 27, 3, 1272–1288. DOI: 10.1109/tnet.2019.2916577.

Watkins, C. 1989. Learning from delayed rewards. PhD thesis, University of Cambridge.

Watkins, C.J.C.H. and Dayan, P. 1992. Q-learning. Machine Learning 8, 3-4, 279–292. DOI: 10.1007/bf00992698.

Whitin, S. 1953. The theory of inventory management. Princeton University Press.

Whitt, W. 1978. Approximations of dynamic programs, I. Mathematics of Operations Research 4, 3, 231–243. DOI: https://doi.org/10.1287/moor.3.3.231.

Whitt, W. 1979. Approximations of dynamic programs, II. Mathematics of Operations Research 4, 2, 179–185. DOI: 10.1287/moor.4.2.179.

Whittle, P. 1982. Optimization over time: Dynamic programming and stochastic control. Vol. 1 and 2. Wiley.

Whittle, P. 1996. Optimal control: Basics and beyond. Wiley.

Whittle, P. 2002. Risk sensitivity, A strangely pervasive concept. Macroeconomic Dynamics 6, 1, 5–18. DOI: 10.1017/s1365100502027025.

Whittle, P. and Komarova, N. 1988. Policy improvement and the newton-raphson algorithm. Probability in the Engineering and Informational Sciences 2, 2, 249–255. DOI: 10.1017/s0269964800000760.

Wiewiora, E. 2003. Potential-based shaping and q-value initialization are equivalent. Journal of Artificial Intelligence Research 19, 1, 205–208.

Witsenhausen, H.S. 1969. Inequalities for the performance of suboptimal uncertain systems. Automatica 5, 4, 507–512. DOI: 10.1016/0005-1098(69)90112-5.

Witsenhausen, H.S. 1970. On performance bounds for uncertain systems. SIAM Journal on Control 8, 1, 55–89. DOI: 10.1137/0308004.

Witsenhausen, H.S. 1973. A standard form for sequential stochastic control. Mathematical Systems Theory 7, 1, 5–11. DOI: 10.1007/bf01824800.

Witsenhausen, H.S. 1975. On policy independence of conditional expectation. Information and Control 28, 65–75.

Witsenhausen, H.S. 1976. Some remarks on the concept of state. In: Y.C. Ho and S.K. Mitter, eds., Directions in large-scale systems. Plenum, 69–75.

Witsenhausen, H.S. 1979. On the structure of real-time source coders. Bell System Technical Journal 58, 6, 1437–1451.

Wittenmark, B., Åström, K.J., and Årzén, K.-E. 2002. Computer control: An overview. In: IFAC professional brief. IFAC. Available at: https://www.ifac-control.org/publications/list-of-professional-briefs/pb_wittenmark_etal_final.pdf.

Wonham, W.M. 1968. On a matrix riccati equation of stochastic control. SIAM Journal on Control 6, 4, 681–697. DOI: 10.1137/0306044.

Woodall, W.H. and Reynolds, M.R. 1983. A discrete markov chain representation of the sequential probability ratio test. Communications in Statistics. Part C: Sequential Analysis 2, 1, 27–44. DOI: 10.1080/07474948308836025.

Yang, Z.P. and Feng, X.X. 2002. A note on the trace inequality for products of hermitian matrix power. JIPAM. Journal of Inequalities in Pure & Applied Mathematics 3, 5, Paper No. 78, 12 p., electronic only–Paper No. 78, 12 p., electronic only. Available at: http://eudml.org/doc/123245.

Yeh, E.M. 2012. Fundamental performance limits in cross-layer wireless optimization: Throughput, delay, and energy. Foundations and Trends in Communications and Information Theory 9, 1, 1–112. DOI: 10.1561/0100000014.

Zhang, H. 2009. Partially observable Markov decision processes: A geometric technique and analysis. Operations Research.

Zhang, N. and Liu, W. 1996. Planning in stochastic domains: Problem characteristics and approximation. Hong Kong Univeristy of Science; Technology.

Zheng, Y.-S. and Federgruen, A. 1991. Finding optimal (s, s) policies is about as simple as evaluating a single policy. Operations Research 39, 4, 654–665. DOI: 10.1287/opre.39.4.654.

Zhou, Y., Song, Y., and Yüksel, S. 2024. Robustness to model approximation, empirical model learning, and sample complexity in wasserstein regular MDPs. Available at: https://arxiv.org/abs/2410.14116.

Zipkin, P.H. 2000. Foundations of inventory management. McGraw-Hiil.

Zolotarev, V.M. 1984. Probability metrics. Theory of Probability & Its Applications 28, 2, 278–302. DOI: 10.1137/1128025.