in imitation learning. It operates by training an ensemble of policies on the expert
demonstration data, and using the variance of their predictions as a cost which is minimized
with RL together with a supervised behavioral cloning cost. Unlike adversarial imitation
methods, it uses a fixed reward function which is easy to optimize. We prove a regret bound
for the algorithm which is linear in the time horizon multiplied by a coefficient which we show …