particular, the sufficient optimism in the mean reward estimates is achieved by exploiting the
variance in the past observed rewards. We name the algorithm Capitalizing On Rewards
(CORe). The algorithm is general and can be easily applied to different bandit settings. The
main benefit of CORe is that its exploration is fully data-dependent. It does not rely on any
external noise and adapts to different problems without parameter tuning. We derive a $\tilde …