random variables with unknown mean that are each instantiated in an iid fashion over time.
At each time multiple random variables can be selected, subject to an arbitrary constraint on
weights associated with the selected variables. All of the selected individual random
variables are observed at that time, and a linearly weighted combination of these selected
variables is yielded as the reward. The goal is to find a policy that minimizes regret, defined …