查看文章

aaai.org 中的 [PDF]

MCTS based on simple regret

作者

David Tolpin, Solomon Shimony

发表日期

2012

期刊

Proceedings of the AAAI Conference on Artificial Intelligence

卷号

期号

页码范围

570-576

简介

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB, a sampling policy for the Multi-armed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the final``arm pull''(the actual move selection) that collects a reward, rather than all``arm pulls''. Therefore, it makes more sense to minimize the simple regret, as opposed to the cumulative regret. We begin by introducing policies for multi-armed bandits with lower finite-time and asymptotic simple regret than UCB, using it to develop a two-stage scheme (SR+ CR) for MCTS which outperforms UCT empirically. Optimizing the sampling process is itself a metareasoning problem, a solution of which can use value of information (VOI) techniques. Although the theory of VOI for search exists, applying it to MCTS is non-trivial, as typical myopic assumptions fail. Lacking a complete working VOI theory for MCTS, we nevertheless propose a sampling scheme that is``aware''of VOI, achieving an algorithm that in empirical evaluation outperforms both UCT and the other proposed algorithms.

引用总数

被引用次数：60

2013201420152016201720182019202020212022202320244 8 6 4 2 6 5 8 4 3 7 3

学术搜索中的文章

MCTS based on simple regret

D Tolpin, S Shimony - Proceedings of the AAAI Conference on Artificial …, 2012

被引用次数：60 相关文章所有 14 个版本