作者
Sridhar Mahadevan
发表日期
1996/1
期刊
Machine learning
卷号
22
期号
1
页码范围
159-195
出版商
Kluwer Academic Publishers-Plenum Publishers
简介
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal …
引用总数
19961997199819992000200120022003200420052006200720082009201020112012201320142015201620172018201920202021202220232024111210136102522191720292017252425221517171325373544324516