Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations

Y Gai, B Krishnamachari, R Jain - IEEE/ACM Transactions on …, 2012 - ieeexplore.ieee.org
We formulate the following combinatorial multi-armed bandit (MAB) problem: There are N
random variables with unknown mean that are each instantiated in an iid fashion over time.
At each time multiple random variables can be selected, subject to an arbitrary constraint on
weights associated with the selected variables. All of the selected individual random
variables are observed at that time, and a linearly weighted combination of these selected
variables is yielded as the reward. The goal is to find a policy that minimizes regret, defined …

Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards

Y Gai, B Krishnamachari, R Jain - arXiv preprint arXiv:1011.4748, 2010 - arxiv.org
In the classic multi-armed bandits problem, the goal is to have a policy for dynamically
operating arms that each yield stochastic rewards with unknown means. The key metric of
interest is regret, defined as the gap between the expected total reward accumulated by an
omniscient player that knows the reward means for each arm, and the expected total reward
accumulated by the given policy. The policies presented in prior work have storage,
computation and regret all growing linearly with the number of arms, which is not scalable …
以上显示的是最相近的搜索结果。 查看全部搜索结果