References
Adell, J.A. and Jodrá, P. 2006. Exact
kolmogorov and total variation distances between some familiar discrete
distributions. Journal of Inequalities and Applications
2006, 1–8. DOI: 10.1155/jia/2006/64307.
Afshari, M. and Mahajan, A. 2023.
Decentralized linear quadratic systems with major and minor agents and
non-gaussian noise. IEEE Transactions on Automatic
Control 68, 8, 4666–4681. DOI: 10.1109/tac.2022.3210049.
Altman, Eitan. 1999. Constrained
markov decision processes. CRC Press. Available at: http://www-sop.inria.fr/members/Eitan.Altman/TEMP/h.pdf.
Altman, E. and Nain, P. 1992. Closed-loop
control with delayed information. ACM SIGMETRICS Performance
Evaluation Review 20, 1, 193–204. DOI: 10.1145/149439.133106.
Arabneydi, J. and Mahajan, A. 2015.
Reinforcement learning in decentralized stochastic control systems with
partial history sharing. 2015 american control conference
(ACC), IEEE. DOI: 10.1109/acc.2015.7172192.
Arabneydi, J. and Mahajan, A. 2016.
Linear quadratic mean field teams: Optimal and approximately optimal
decentralized solutions. Available at: https://arxiv.org/abs/1609.00056v2.
Arrow, K.J., Blackwell, D., and Girshick,
M.A. 1949. Bayes and minimax solutions of sequential decision
problems. Econometrica 17, 3/4, 213. DOI: 10.2307/1905525.
Arrow, K.J., Harris, T., and Marschak, J.
1952. Optimal inventory policy. Econometrica 20, 1,
250–272. DOI: 10.2307/1907830.
Arthur, W.B. 1994. Increasing returns
and path dependence in the economy. University of Michigan Press.
DOI: 10.3998/mpub.10029.
Artzrouni, M. 1986. On the convergence of
infinite products of matrices. Linear Algebra and its
Applications 74, 11–21. DOI: 10.1016/0024-3795(86)90112-6.
Asadi, K., Misra, D., and Littman, M.
2018. Lipschitz continuity in model-based reinforcement
learning. Proceedings of the 35th international conference on
machine learning, PMLR, 264–273. Available at: https://proceedings.mlr.press/v80/asadi18a.html.
Åström, K.J. 1970. Introduction to
stochastic control theory. Dover.
Athans, M. 1971. The role and use of the
stochastic linear-quadratic-gaussian problem in control system design.
IEEE Transactions on Automatic Control
16, 6, 529–552. DOI: 10.1109/tac.1971.1099818.
Bai, C.-Z., Katewa, V., Gupta, V., and Huang,
Y.-F. 2015. A stochastic sensor selection scheme for sequential
hypothesis testing with multiple sensors. IEEE transactions on
signal processing 63, 14, 3687–3699.
Bander, J.L. and White, C.C. 1999. Markov
decision processes with noise-corrupted and delayed state observations.
Journal of the Operational Research Society 50, 6,
660–668. DOI: 10.1057/palgrave.jors.2600745.
Baras, J.S., Dorsey, A.J., and Makowski,
A.M. 1984. Two competing queues with linear costs: The μc-rule is
often optimal. Advances in Applied Probability 16, 1,
8–8. DOI: 10.1017/s000186780002187x.
Bellman, R. 1957. Dynamic
programming. Princeton University Press.
Bellman, R., Glicksberg, I., and Gross,
O. 1955. On the optimal inventory equation. Management
Science 2, 1, 83–104. DOI: 10.1287/mnsc.2.1.83.
Berry, R.A. 2000. Power and delay
trade-offs in fading channels. PhD thesis, Massachusetts Institute of
Technology. Available at: https://dspace.mit.edu/handle/1721.1/9290.
Berry, R.A. 2013. Optimal power-delay
tradeoffs in fading channels—small-delay asymptotics. IEEE
Transactions on Information Theory 59, 6, 3939–3952. DOI:
10.1109/TIT.2013.2253194.
Berry, R.A. and Gallager, R.G. 2002.
Communication over fading channels with delay constraints.
IEEE Transactions on Information Theory
48, 5, 1135–1149. DOI: 10.1109/18.995554.
Berry, R., Modiano, E., and Zafer, M.
2012. Energy-efficient scheduling under delay constraints for wireless
networks. Synthesis Lectures on Communication Networks
5, 2, 1–96. DOI: 10.2200/S00443ED1V01Y201208CNT011.
Bertsekas, D.P. 2011. Dynamic
programming and optimal control. Athena Scientific. Available at:
http://www.athenasc.com/dpbook.html.
Bertsekas, D.P. 2013. Abstract
dynamic programming. Athena Scientific Belmont. Available at: https://web.mit.edu/dimitrib/www/abstractdp_MIT.html.
Bertsekas, D.P. and Tsitsiklis, J.N.
1996. Neuro-dynamic programming. Athena Scientific.
Bertsekas, D.P. and Tsitsiklis, J.N.
2000. Gradient convergence in gradient methods with errors.
SIAM Journal on Optimization 10, 3,
627–642. DOI: 10.1137/s1052623497331063.
Bitar, E., Poolla, K., Khargonekar, P.,
Rajagopal, R., Varaiya, P., and Wu, F. 2012. Selling random wind.
Hawaii international conference on system sciences, IEEE,
1931–1937.
Blackwell, D. 1964. Memoryless strategies
in finite-stage dynamic programming. The Annals of Mathematical
Statistics 35, 2, 863–865. DOI: 10.1214/aoms/1177703586.
Blackwell, D. 1965. Discounted dynamic
programming. The Annals of Mathematical Statistics 36,
1, 226–235. DOI: 10.1214/aoms/1177700285.
Blackwell, D. 1970. On stationary
policies. Journal of the Royal Statistical Society. Series A
(General) 133, 1, 33. DOI: 10.2307/2343810.
Blum, J.R. 1954. Multidimensional
stochastic approximation methods. The Annals of Mathematical
Statistics 25, 4, 737–744. DOI: 10.1214/aoms/1177728659.
Bogdan, K. and Więcek, M. 2022.
Burkholder inequality by bregman divergence. Available at: http://arxiv.org/pdf/2103.06358v3.
Bohlin, T. 1970. Information pattern for
linear discrete-time models with stochastic coefficients. IEEE
Transactions on Automatic Control (TAC) 15, 1, 104–106.
Borkar, V.S. 2008. Stochastic
approximation. Hindustan Book Agency. DOI: 10.1007/978-93-86279-38-5.
Borkar, V.S. and Meyn, S.P. 2000. The
o.d.e. Method for convergence of stochastic approximation and
reinforcement learning. SIAM Journal on Control and
Optimization 38, 2, 447–469. DOI: 10.1137/s0363012997331639.
Bozkurt, B., Mahajan, A., Nayyar, A., and
Ouyang, Y. 2023. Weighted norm bounds in MDPs with unbounded
per-step cost.
Burda, Y., Edwards, H., Storkey, A., and Klimov,
O. 2019. Exploration by random network distillation.
International conference on learning representations. Available
at: https://openreview.net/forum?id=H1lJJnR5Ym.
Burkholder, D.L. 1966. Martingale
transforms. The Annals of Mathematical Statistics 37,
6, 1494–1504. DOI: 10.1214/aoms/1177699141.
Buyukkoc, C., Varaiya, P., and Walrand,
J. 1985. The cμ rule revisited. Advances in Applied
Probability 17, 1, 237–238. DOI: 10.2307/1427064.
Cassandra, A., Littman, M.L., and Zhang,
N.L. 1997. Incremental pruning: A simple, fast, exact method for
partially observable Markov decision processes.
Proceedings of the thirteenth conference on uncertainty
in artificial intelligence.
Cassandra, A.R., Kaelbling, L.P., and Littman,
M.L. 1994. Acting optimally in partially observable stochastic
domains. AAAI, 1023–1028.
Chakravorty, J. and Mahajan, A. 2018.
Sufficient conditions for the value function and optimal strategy to be
even and quasi-convex. IEEE Transactions on Automatic Control
63, 11, 3858–3864. DOI: 10.1109/TAC.2018.2800796.
Chang, J.T. 2007. Stochastic processes.
Available at: http://www.stat.yale.edu/~pollard/Courses/251.spring2013/Handouts/Chang-notes.pdf.
Chen, H.-F. and Guo, L. 1991.
Identification and stochastic adaptive control. Birkhäuser
Boston. DOI: 10.1007/978-1-4612-0429-9.
Chen, X. 2017. L#-convexity and
its applications in operations. Frontiers of Engineering
Management 4, 3, 283. DOI: 10.15302/j-fem-2017057.
Cheng, H.-T. 1988. Algorithms for
partially observable markov decision processes. PhD thesis, University
of British Columbia, Vancouver, BC.
Daley, D.J. 1968. Stochastically monotone
markov chains. Zeitschrift für
Wahrscheinlichkeitstheorie und verwandte Gebiete 10, 4,
305–317. DOI: 10.1007/BF00531852.
Davis, M.H.A. 1979. Martingale methods in
stochastic control. In: Stochastic control theory and stochastic
differential systems. Springer-Verlag, 85–117. DOI: 10.1007/bfb0009377.
Davis, M.H.A. and Varaiya, P.P. 1972.
Information states for linear stochastic systems. Journal of
Mathematical Analysis and Applications 37, 2, 384–402.
DeGroot, M. 1970. Optimal statistical
decisions. Wiley-Interscience, Hoboken, N.J.
Dellacherie, C. and Meyer, P.-A. 1982.
Probabilities and potential B: Theory of
martingales. North-Holland Mathematical Studies.
Devlin, S. 2014. Potential based reward
shaping tutorial. Available at: http://www-users.cs.york.ac.uk/~devlin/presentations/pbrs-tut.pdf.
Devlin, S. and Kudenko, D. 2012. Dynamic
potential-based reward shaping. Proceedings of the 11th
international conference on autonomous agents and multiagent
systems, International Foundation for Autonomous Agents; Multiagent
Systems, 433–440.
Dibangoye, J.S., Amato, C., Buffet, O., and
Charpillet, F. 2016. Optimally solving dec-POMDPs as
continuous-state MDPs. Journal of Artificial Intelligence
Research 55, 443–497. DOI: 10.1613/jair.4623.
Ding, N., Sadeghi, P., and Kennedy, R.A.
2016. On monotonicity of the optimal transmission policy in cross-layer
adaptive m -QAM modulation.
IEEE Transactions on Communications 64, 9, 3771–3785.
DOI: 10.1109/TCOMM.2016.2590427.
Doob, J.L. 1971. What is a martingale?
The American Mathematical Monthly 78, 5, 451. DOI: 10.2307/2317751.
Dorato, P. and Levis, A. 1971. Optimal
linear regulators: The discrete-time case. IEEE
Transactions on Automatic Control 16, 6, 613–620. DOI: 10.1109/tac.1971.1099832.
Dubins, L.E. and Savage, L.J. 2014.
How to gamble if you must: Inequalities for stochastic
processes. Dover Publications.
Durrett, R. 2019. Probability: Theory
and examples. Cambridge University Press. DOI: 10.1017/9781108591034.
Dutta, M. and Singh, R. 2024. Optimal
risk-sensitive scheduling policies for remote estimation of
autoregressive markov processes. Available at: http://arxiv.org/pdf/2403.13898v1.
Dvoretzky, A., Kiefer, J., and Wolfowitz,
J. 1953. On the optimal character of the (s, s) policy in
inventory theory. Econometrica 21, 4, 586. DOI: 10.2307/1907924.
Edgeworth, F.Y. 1888. The mathematical
theory of banking. Journal of the Royal Statistical Society
51, 1, 113–127. Available at: https://www.jstor.org/stable/2979084.
Elliott, R., Li, X., and Ni, Y.-H. 2013.
Discrete time mean-field stochastic linear-quadratic optimal control
problems. Automatica 49, 11, 3222–3233. DOI: 10.1016/j.automatica.2013.08.017.
Ellis, R.S. 1985. Entropy, large
deviations, and statistical mechanics. Springer New York. DOI: 10.1007/978-1-4613-8533-2.
Feinberg, E.A. 2005. On essential
information in sequential decision processes. Mathematical Methods
of Operations Research 62, 3, 399–410. DOI: 10.1007/s00186-005-0035-3.
Feinberg, E.A. 2016. Optimality
conditions for inventory control. In: Optimization challenges in
complex, networked and risky systems. INFORMS, 14–45. DOI: 10.1287/educ.2016.0145.
Feinberg, E.A. and He, G. 2020.
Complexity bounds for approximately solving discounted MDPs
by value iterations. Operations Research Letters. DOI: 10.1016/j.orl.2020.07.001.
Ferguson, T.S. 1989. Who solved the
secretary problem? Statistical science, 282–289.
Ferguson, T.S. and Gilstein, C.Z. 2004.
Optimal investment policies for the horse race model". Available at: https://www.math.ucla.edu/~tom/papers/unpublished/Zach2.pdf.
Föllmer, H. and Schied, A. 2010. Convex
risk measures. In: Encyclopedia of quantitative finance.
American Cancer Society. DOI: 10.1002/9780470061602.eqf15003.
Freeman, P.R. 1983. The secretary problem
and its extensions: A review. International Statistical Review /
Revue Internationale de Statistique 51, 2, 189. DOI: 10.2307/1402748.
Fu, F. and Schaar, M. van der. 2012.
Structure-aware stochastic control for transmission scheduling. IEEE
Transactions on Vehicular Technology 61, 9, 3931–3945.
DOI: 10.1109/tvt.2012.2213850.
Fu, M.C. 2018. Monte carlo tree search: A
tutorial. 2018 winter simulation conference (WSC), IEEE. DOI:
10.1109/wsc.2018.8632344.
Gao, S. and Mahajan, A. 2022. Optimal
control of network-coupled subsystems: Spectral decomposition and
low-dimensional solutions. IEEE Transactions on Control
of Network Systems 9, 2, 657–669. DOI: 10.1109/tcns.2021.3124259.
Geiss, S. and Scheutzow, M. 2021.
Sharpness of Lenglart’s domination
inequality and a sharp monotone version. Electronic Communications
in Probability 26, none, 1–8. DOI: 10.1214/21-ECP413.
Geist, M., Scherrer, B., and Pietquin, O.
2019. A theory of regularized Markov decision processes.
Proceedings of the 36th international conference on machine
learning, PMLR, 2160–2169. Available at: https://proceedings.mlr.press/v97/geist19a.html.
Gelada, C., Kumar, S., Buckman, J., Nachum, O.,
and Bellemare, M.G. 2019. DeepMDP:
Learning continuous latent space models for representation learning.
Proceedings of the 36th international conference on machine
learning, PMLR, 2170–2179. Available at: http://proceedings.mlr.press/v97/gelada19a.html.
Gladyshev, E.G. 1965. On stochastic
approximation. Theory of Probability and Its Applications
10, 2, 275–278. DOI: 10.1137/1110031.
Grzes, M. and Kudenko, D. 2009.
Theoretical and empirical analysis of reward shaping in reinforcement
learning. International conference on machine learning and
applications, 337–344. DOI: 10.1109/ICMLA.2009.33.
Hardy, G.H., Littlewood, J.E., and Pólya,
G. 1952. Inequalities. Cambridge University Press.
Harris, F.W. 1913. How many parts to make
at once. The magazine of management 10, 2, 135–152.
DOI: 10.1287/opre.38.6.947.
Hay, N., Russell, S., Tolpin, D., and Shimony,
S.E. 2012. Selecting computations: Theory and applications.
UAI. Available at: http://www.auai.org/uai2012/papers/123.pdf.
Hernandez-Hernández, D. and Marcus, S.I.
1996. Risk sensitive control of markov processes in countable state
space. Systems & Control Letters 29,
3, 147–155. DOI: 10.1016/s0167-6911(96)00051-5.
Hernández-Hernández, D. 1999. Existence
of risk-sensitive optimal stationary policies for controlled markov
processes. Applied Mathematics and Optimization 40, 3,
273–285. DOI: 10.1007/s002459900126.
Hernández-Lerma, O. and Lasserre, J.B.
1996. Discrete-time markov control processes. Springer New
York. DOI: 10.1007/978-1-4612-0729-0.
Hernández-Lerma, O. and Lasserre, J.B.
1999. Further topics on discrete-time markov control processes.
Springer New York. DOI: 10.1007/978-1-4612-0561-6.
Hinderer, K. 2005. Lipschitz continuity
of value functions in Markovian decision processes.
Mathematical Methods of Operations Research 62, 1,
3–22. DOI: 10.1007/s00186-005-0438-1.
Hopcroft, J. and Kannan, R. 2012.
Computer science theory for the information age. Available at: https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/hopcroft-kannan-feb2012.pdf.
Howard, R.A. 1960. Dynamic
programming and markov processes. The M.I.T. Press.
Howard, R.A. and Matheson, J.E. 1972.
Risk-sensitive markov decision processes. Management Science
18, 7, 356–369. DOI: 10.1287/mnsc.18.7.356.
Jenner, E., Hoof, H. van, and Gleave, A.
2022. Calculus on MDPs: Potential shaping as a gradient. Available at:
https://arxiv.org/abs/2208.09570v1.
Kalman, R.E. 1960. Contributions to the
theory of optimal control. Boletin de la Sociedad Matematica
Mexicana 5, 102–119.
Karatzas, I. and Sudderth, W.D. 2010. Two
characterizations of optimality in dynamic programming. Applied
Mathematics and Optimization 61, 3, 421–434. DOI: 10.1007/s00245-009-9093-x.
Keilson, J. and Kester, A. 1977. Monotone
matrices and monotone markov processes. Stochastic Processes and
their Applications 5, 3, 231–241.
Kelly, J.L., Jr. 1956. A new
interpretation of information rate. Bell System Technical
Journal 35, 4, 917–926. DOI: 10.1002/j.1538-7305.1956.tb03809.x.
Kennerly, S. 2011. A graphical derivation
of the legendre transform. Available at: http://einstein.drexel.edu/~skennerly/maths/Legendre.pdf.
Koole, G. 2006. Monotonicity in markov
reward and decision chains: Theory and applications. Foundations and
Trends in Stochastic Systems 1, 1, 1–76. DOI:
10.1561/0900000002.
Kuhn, H.W. 1950. Extensive games.
Proceedings of the National Academy of Sciences 36,
10, 570–576. DOI: 10.1073/pnas.36.10.570.
Kuhn, H.W. 1953. Extensive games and the
problem of information. In: H.W. Kuhn and A.W. Tucker, eds.,
Contributions to the theory of games. Princeton University
Press, 193–216.
Kumar, P.R. and Varaiya, P. 1986.
Stochastic systems: Estimation identification and adaptive
control. Prentice Hall.
Kunnumkal, S. and Topaloglu, H. 2008.
Exploiting the structural properties of the underlying markov decision
problem in the q-learning algorithm. INFORMS Journal on
Computing 20, 2, 288–301. DOI: 10.1287/ijoc.1070.0240.
Kushner, H.J. and Yin, G.G. 1997.
Stochastic approximation algorithms and applications. Springer
New York. DOI: 10.1007/978-1-4899-2696-8.
Kwakernaak, H. 1965. Theory of
self-adaptive control systems. In: Springer, 14–18.
Lai, T.L. 2003. Stochastic approximation:
Invited paper. The Annals of Statistics 31, 2. DOI: 10.1214/aos/1051027873.
Lenglart, É. 1977. Relation de domination
entre deux processus. Annales de l’institut henri
poincaré. Section b. Calcul des probabilités
et statistiques, 171–179.
Levy, H. 1992. Stochastic dominance and
expected utility: Survey and analysis. Management Science
38, 4, 555–593. DOI: 10.1287/mnsc.38.4.555.
Levy, H. 2015. Stochastic dominance:
Investment decision making under uncertainty. Springer. DOI: 10.1007/978-3-319-21708-6.
Lewis, F.L., Vrabie, D., and Syrmos, V.L.
2012. Optimal control. John Wiley & Sons.
Lindley, D.V. 1961. Dynamic programming
and decision theory. Applied Statistics 10, 1, 39.
DOI: 10.2307/2985407.
Lu, X., Roy, B.V., Dwaracherla, V., Ibrahimi,
M., Osband, I., and Wen, Z. 2023. Reinforcement learning, bit by
bit. Foundations and Trends in Machine Learning
16, 6, 733–865. DOI: 10.1561/2200000097.
Mahajan, A. 2008. Sequential
decomposition of sequential dynamic teams: Applications to real-time
communication and networked control systems. PhD thesis, University of
Michigan, Ann Arbor, MI.
Mahajan, A., Niculescu, S.-I., and Vidyasagar,
M. 2024. A vector almost-sure supermartingale theorem and its
applications. In: IEEE conference on decision and control.
IEEE.
Marshall, A.W., Olkin, I., and Arnold,
B.C. 2011. Inequalities: Theory of majorization and its
applications. Springer New York. DOI: 10.1007/978-0-387-68276-1.
Mazliak, L. and Shafer, G., eds. 2022.
The splendors and miseries of martingales: Their history from the
casino to mathematics. Springer International Publishing. DOI: 10.1007/978-3-031-05988-9.
Morse, P. and Kimball, G. 1951.
Methods of operations research. Technology Press of MIT.
Müller, A. 1997a. Integral probability
metrics and their generating classes of functions. Advances in
Applied Probability 29, 2, 429–443. DOI: 10.2307/1428011.
Müller, A. 1997b. How does the value
function of a markov decision process depend on the transition
probabilities? Mathematics of Operations Research 22,
4, 872–885. DOI: 10.1287/moor.22.4.872.
Murota, K. 1998. Discrete convex
analysis. Mathematical Programming 83, 1–3, 313–371.
DOI: 10.1007/bf02680565.
Nain, P., Tsoucas, P., and Walrand, J.
1989. Interchange arguments in stochastic scheduling. Journal of
Applied Probability 26, 4, 815–826. DOI: 10.2307/3214386.
Nerode, A. 1958. Linear automaton
transformations. Proceedings of American Mathematical
Society 9, 541–544.
Neveu, J. 1975. Discrete parameter
martingales. North Holland.
Ng, A.Y., Harada, D., and Russell, S.
1999. Policy invariance under reward transformations: Theory and
application to reward shaping. ICML, 278–287. Available at: http://aima.eecs.berkeley.edu/~russell/papers/icml99-shaping.pdf.
Norris, J.R. 1998. Markov
chains. Cambridge university press.
Oh, S. and Özer, Ö. 2016. Characterizing
the structure of optimal stopping policies. Production and
Operations Management 25, 11, 1820–1838. DOI: 10.1111/poms.12579.
Picard, J. 2007. Concentration
inequalities and model selection. Springer Berlin Heidelberg. DOI:
10.1007/978-3-540-48503-2.
Piunovskiy, A.B. 2011. Examples in
markov decision processes. Imperial College Proess. DOI: 10.1142/p809.
Pollard, D. 2002. A user’s guide to
measure theoretic probability. Cambridge University Press.
Pomatto, L., Strack, P., and Tamuz, O.
2020. Stochastic dominance under independent noise. Journal of
Political Economy 128, 5, 1877–1900. DOI: 10.1086/705555.
Porteus, E.L. 1975. Bounds and
transformations for discounted finite markov decision chains.
Operations Research 23, 4, 761–784. DOI: 10.1287/opre.23.4.761.
Porteus, E.L. 2008. Building intuition:
Insights from basic operations management models and principles. In: D.
Chhajed and T.J. Lowe, eds., Springer, 115–134. DOI: 10.1007/978-0-387-73699-0.
Puterman, M.L. 2014. Markov decision
processes: Discrete stochastic dynamic programming. John Wiley
& Sons. DOI: 10.1002/9780470316887.
Qin, Y., Cao, M., and Anderson, B.D.O.
2020. Lyapunov criterion for stochastic systems and its applications in
distributed computation. IEEE Transactions on Automatic
Control 65, 2, 546–560. DOI: 10.1109/tac.2019.2910948.
Rachelson, E. and Lagoudakis, M.G. 2010.
On the locality of action domination in sequential decision making.
Proceedings of 11th international symposium on artificial
intelligence and mathematics. Available at: https://oatao.univ-toulouse.fr/17977/.
Rachev, S.T. 1991. Probability
metrics and the stability of stochastic models. Wiley, New York.
Rigollet, P. 2015. High-dimensional
statistics. Available at: https://ocw.mit.edu/courses/mathematics/18-s997-high-dimensional-statistics-spring-2015/lecture-notes/.
Riis, J.O. 1965. Discounted
Markov programming in a periodic process. Operations
Research 13, 6, 920–929. DOI: 10.1287/opre.13.6.920.
Rivasplata, O. 2012. Subgaussian random
variables: An expository note. Available at: http://stat.cmu.edu/~arinaldo/36788/subgaussians.pdf.
Robbins, H. and Monro, S. 1951. A
stochastic approximation method. The Annals of Mathematical
Statistics 22, 3, 400–407. DOI: 10.1214/aoms/1177729586.
Robbins, H. and Siegmund, D. 1971. A
convergence theorem for non-negative almost supermartingales and some
applications. In: Optimizing methods in statistics. Elsevier,
233–257. DOI: 10.1016/b978-0-12-604550-5.50015-8.
Rockafellar, R.T. and Wets, R.J.-B. 2009.
Variational analysis. Springer Science & Business Media.
Ross, S.M. 1974. Dynamic programming and
gambling models. Advances in Applied Probability 6, 3,
593–606. DOI: 10.2307/1426236.
Roy, A., Borkar, V., Karandikar, A., and
Chaporkar, P. 2022. Online reinforcement learning of optimal
threshold policies for Markov decision processes. IEEE
Transactions on Automatic Control 67, 7, 3722–3729. DOI:
10.1109/tac.2021.3108121.
Saldi, N., Linder, T., and Yüksel, S.
2018. Finite approximations in discrete-time stochastic
control. Springer International Publishing. DOI: 10.1007/978-3-319-79033-6.
Sandell, J., Nils R. 1974. Control of
finite-state, finite-memory stochastic systems. PhD thesis,
Massachussets Institute of Technology, Cambridge, MA.
Sayedana, B. and Mahajan, A. 2020.
Counterexamples on the monotonicity of delay optimal strategies for
energy harvesting transmitters. IEEE Wireless
Communications Letters, 1–1. DOI: 10.1109/lwc.2020.2981066.
Sayedana, B., Mahajan, A., and Yeh, E.
2020. Cross-layer communication over fading channels with adaptive
decision feedback. International symposium on modeling and
optimization in mobile, ad hoc, and wireless networks (WiOPT), 1–8.
Scarf, H. 1960. Mathematical methods in
social sciences. In: J. Arrow S. Karlin and P. Suppes, eds., Stanford
University Press, Stanford CA, 49–56. Available at: http://dido.wss.yale.edu/~hes/pub/ss-policies.pdf.
Scherrer, B. 2016. On periodic markov
decision processes. Available at: https://ewrl.files.wordpress.com/2016/12/scherrer.pdf.
Serfozo, R.F. 1976. Monotone optimal
policies for markov decision processes. In: Mathematical programming
studies. Springer Berlin Heidelberg, 202–215. DOI: 10.1007/bfb0120752.
Shebrawai, K. and Albadawani, H. 2012.
Trace inequalities for matrices. Bulletin of the Australian
Mathematical Society 87, 1, 139–148. DOI: 10.1017/s0004972712000627.
Shwartz, A. 2001. Death and discounting.
IEEE Transactions on Automatic Control
46, 4, 644–647. DOI: 10.1109/9.917668.
Simon, H.A. 1956. Dynamic programming
under uncertainty with a quadratic criterion function.
Econometrica 24, 1, 74–81. DOI: 10.2307/1905261.
Singh, S.P. and Yee, R.C. 1994. An upper
bound on the loss from approximate optimal-value functions. Machine
Learning 16, 3, 227–233. DOI: 10.1007/bf00993308.
Sinha, A. and Mahajan, A. 2024. On the
sensitivity of restless bandit solutions to uncertainty in the model of
the arms.
Skinner, B.F. 1938. Behavior of
organisms. Appleton-Century.
Smallwood, R.D. and Sondik, E.J. 1973.
The optimal control of partially observable markov processes over a
finite horizon. Operations Research 21, 5, 1071–1088.
DOI: 10.1287/opre.21.5.1071.
Smith, J.E. and McCardle, K.F. 2002.
Structural properties of stochastic dynamic programs. Operations
Research 50, 5, 796–809. DOI: 10.1287/opre.50.5.796.365.
Stout, W.F. 1974. Almost sure
convergence. Academic Press.
Striebel, C. 1965. Sufficient statistics
in the optimal control of stochastic systems. Journal of
Mathematical Analysis and Applications 12, 576–592.
Strusevich, V.A. and Rustogi, K. 2016.
Pairwise interchange argument and priority rules. In: Scheduling
with time-changing effects and rate-modifying activities. Springer
International Publishing, 19–36. DOI: 10.1007/978-3-319-39574-6_2.
Subramanian, J., Sinha, A., Seraj, R., and
Mahajan, A. 2022. Approximate information state for approximate
planning and reinforcement learning in partially observed systems.
Journal of Machine Learning Research 23, 12, 1–83.
Available at: http://jmlr.org/papers/v23/20-1165.html.
Sutton, R.S. and Barto, A.G. 2018.
Reinforcement learning: An introduction. MIT Press.
Taylor, H.M. 1967. Evaluating a call
option and optimal timing strategy in the stock market. Management
Science 14, 1, 111–120. Available at: http://www.jstor.org/stable/2628546.
Theil, H. 1954. Econometric models and
welfare maximization. Wirtschaftliches Archiv 72,
60–83. DOI: 10.1007/978-94-011-2410-2_1.
Theil, H. 1957. A note on certainty
equivalence in dynamic planning. Econometrica, 346–349. DOI: 10.1007/978-94-011-2410-2_3.
Topkis, D.M. 1998. Supermodularity
and complementarity. Princeton University Press.
Trench, W.F. 1999. Invertibly convergent
infinite products of matrices. Journal of Computational and Applied
Mathematics 101, 1–2, 255–263. DOI: 10.1016/s0377-0427(98)00191-5.
Tsitsiklis, J.N. 1984. Periodic review
inventory systems with continuous demand and discrete order sizes.
Management Science 30, 10, 1250–1254. DOI: 10.1287/mnsc.30.10.1250.
Tsitsiklis, J.N. and Roy, B. van. 1996.
Feature-based methods for large scale dynamic programming. Machine
Learning 22, 1-3, 59–94. DOI: 10.1007/bf00114724.
Urgaonkar, R., Wang, S., He, T., Zafer, M.,
Chan, K., and Leung, K.K. 2015. Dynamic service migration and
workload scheduling in edge-clouds. Performance Evaluation
91, 205–228. DOI: 10.1016/j.peva.2015.06.013.
Veinott, A.F. 1965. The optimal inventory
policy for batch ordering. Operations Research 13, 3,
424–432. DOI: 10.1287/opre.13.3.424.
Veinott, A.F., Jr. 1966. On the opimality
of (s,s) inventory policies: New conditions and a new proof. SIAM
Journal on Applied Mathematics 14, 5, 1067–1083. DOI: 10.1137/0114086.
Vidyasagar, M. 2023. Convergence of
stochastic approximation via martingale and converse
Lyapunov methods. Mathematics of Control, Signals, and
Systems 35, 2, 351–374. DOI: 10.1007/s00498-023-00342-9.
Villani, C. et al. 2008. Optimal
transport: Old and new. Springer.
Wainwright, M.J. 2019.
High-dimensional statistics. Cambridge University Press. DOI:
10.1017/9781108627771.
Wald, A. 1945. Sequential tests of
statistical hypotheses. The Annals of Mathematical Statistics
16, 2, 117–186. DOI: 10.1214/aoms/1177731118.
Wald, A. and Wolfowitz, J. 1948. Optimum
character of the sequential probability ratio test. The Annals of
Mathematical Statistics 19, 3, 326–339. DOI: 10.1214/aoms/1177730197.
Walrand, J. 1988. An introduction to
queueing networks. Prentice Hall.
Wang, S., Urgaonkar, R., Zafer, M., He, T.,
Chan, K., and Leung, K.K. 2019. Dynamic service migration in
mobile edge computing based on Markov decision process.
IEEE/ACM Transactions on Networking
27, 3, 1272–1288. DOI: 10.1109/tnet.2019.2916577.
Whitin, S. 1953. The theory of
inventory management. Princeton University Press.
Whitt, W. 1978. Approximations of dynamic
programs, I. Mathematics of Operations Research
4, 3, 231–243. DOI: https://doi.org/10.1287/moor.3.3.231.
Whitt, W. 1979. Approximations of dynamic
programs, II. Mathematics of Operations Research
4, 2, 179–185. DOI: 10.1287/moor.4.2.179.
Whittle, P. 1982. Optimization over
time: Dynamic programming and stochastic control. Vol. 1 and 2.
Wiley.
Whittle, P. 1996. Optimal control:
Basics and beyond. Wiley.
Whittle, P. 2002. Risk sensitivity,
A strangely pervasive concept. Macroeconomic
Dynamics 6, 1, 5–18. DOI: 10.1017/s1365100502027025.
Whittle, P. and Komarova, N. 1988. Policy
improvement and the newton-raphson algorithm. Probability in the
Engineering and Informational Sciences 2, 2, 249–255. DOI:
10.1017/s0269964800000760.
Wiewiora, E. 2003. Potential-based
shaping and q-value initialization are equivalent. Journal of
Artificial Intelligence Research 19, 1, 205–208.
Witsenhausen, H.S. 1969. Inequalities for
the performance of suboptimal uncertain systems. Automatica
5, 4, 507–512. DOI: 10.1016/0005-1098(69)90112-5.
Witsenhausen, H.S. 1970. On performance
bounds for uncertain systems. SIAM Journal on Control
8, 1, 55–89. DOI: 10.1137/0308004.
Witsenhausen, H.S. 1973. A standard form
for sequential stochastic control. Mathematical Systems Theory
7, 1, 5–11. DOI: 10.1007/bf01824800.
Witsenhausen, H.S. 1975. On policy
independence of conditional expectation. Information and
Control 28, 65–75.
Witsenhausen, H.S. 1976. Some remarks on
the concept of state. In: Y.C. Ho and S.K. Mitter, eds., Directions
in large-scale systems. Plenum, 69–75.
Witsenhausen, H.S. 1979. On the structure
of real-time source coders. Bell System Technical Journal
58, 6, 1437–1451.
Wittenmark, B., Åström, K.J., and Årzén,
K.-E. 2002. Computer control: An overview. In: IFAC
professional brief. IFAC. Available at: https://www.ifac-control.org/publications/list-of-professional-briefs/pb_wittenmark_etal_final.pdf.
Wonham, W.M. 1968. On a matrix riccati
equation of stochastic control. SIAM Journal on Control
6, 4, 681–697. DOI: 10.1137/0306044.
Woodall, W.H. and Reynolds, M.R. 1983. A
discrete markov chain representation of the sequential probability ratio
test. Communications in Statistics. Part C: Sequential Analysis
2, 1, 27–44. DOI: 10.1080/07474948308836025.
Yang, Z.P. and Feng, X.X. 2002. A note on
the trace inequality for products of hermitian matrix power. JIPAM.
Journal of Inequalities in Pure & Applied Mathematics
3, 5, Paper No. 78, 12 p., electronic only–Paper No. 78, 12 p.,
electronic only. Available at: http://eudml.org/doc/123245.
Yeh, E.M. 2012. Fundamental performance
limits in cross-layer wireless optimization: Throughput, delay, and
energy. Foundations and Trends in Communications and Information
Theory 9, 1, 1–112. DOI: 10.1561/0100000014.
Zhang, H. 2009. Partially observable
Markov decision processes: A geometric technique and
analysis. Operations Research.
Zhang, N. and Liu, W. 1996. Planning
in stochastic domains: Problem characteristics and approximation.
Hong Kong Univeristy of Science; Technology.
Zheng, Y.-S. and Federgruen, A. 1991.
Finding optimal (s, s) policies is about as simple as evaluating a
single policy. Operations Research 39, 4, 654–665.
DOI: 10.1287/opre.39.4.654.
Zhou, Y., Song, Y., and Yüksel, S. 2024.
Robustness to model approximation, empirical model learning, and sample
complexity in wasserstein regular MDPs. Available at: https://arxiv.org/abs/2410.14116.
Zipkin, P.H. 2000. Foundations of
inventory management. McGraw-Hiil.
Zolotarev, V.M. 1984. Probability
metrics. Theory of Probability & Its Applications
28, 2, 278–302. DOI: 10.1137/1128025.