作者
David Tolpin, Solomon Shimony
发表日期
2012
期刊
Proceedings of the AAAI Conference on Artificial Intelligence
卷号
26
期号
1
页码范围
570-576
简介
UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB, a sampling policy for the Multi-armed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the final``arm pull''(the actual move selection) that collects a reward, rather than all``arm pulls''. Therefore, it makes more sense to minimize the simple regret, as opposed to the cumulative regret. We begin by introducing policies for multi-armed bandits with lower finite-time and asymptotic simple regret than UCB, using it to develop a two-stage scheme (SR+ CR) for MCTS which outperforms UCT empirically. Optimizing the sampling process is itself a metareasoning problem, a solution of which can use value of information (VOI) techniques. Although the theory of VOI for search exists, applying it to MCTS is non-trivial, as typical myopic assumptions fail. Lacking a complete working VOI theory for MCTS, we nevertheless propose a sampling scheme that is``aware''of VOI, achieving an algorithm that in empirical evaluation outperforms both UCT and the other proposed algorithms.
引用总数
201320142015201620172018201920202021202220232024486426584373
学术搜索中的文章
D Tolpin, S Shimony - Proceedings of the AAAI Conference on Artificial …, 2012