Learning contextual bandits in a non-stationary environment- 学术资源搜索

Learning contextual bandits in a non-stationary environment

Q Wu, N Iyer, H Wang - The 41st International ACM SIGIR Conference …, 2018 - dl.acm.org

The 41st International ACM SIGIR Conference on Research & Development in …, 2018•dl.acm.org

Multi-armed bandit algorithms have become a reference solution for handling the explore/exploit dilemma in recommender systems, and many other important real-world problems, such as display advertisement. However, such algorithms usually assume a stationary reward distribution, which hardly holds in practice as users' preferences are dynamic. This inevitably costs a recommender system consistent suboptimal performance. In this paper, we consider the situation where the underlying distribution of reward remains unchanged over (possibly short) epochs and shifts at unknown time instants. In accordance, we propose a contextual bandit algorithm that detects possible changes of environment based on its reward estimation confidence and updates its arm selection strategy respectively. Rigorous upper regret bound analysis of the proposed algorithm demonstrates its learning effectiveness in such a non-trivial environment. Extensive empirical evaluations on both synthetic and real-world datasets for recommendation confirm its practical utility in a changing environment.

ACM Digital Library

展开收起

被引用次数：104 相关文章所有 8 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果

Google学术搜索按钮

安装不用了

example.edu/paper.pdf

搜索

获取 PDF 文件

引用

References

高级搜索

QQ 群

Learning contextual bandits in a non-stationary environment

引用