with continuous state space and finite actions and average reward criterion. The ERVL
algorithm relies on function approximation via nearest neighbors, and minibatch samples for
value function update. It is universal (will work for any MDP), computationally quite simple
and yet provides arbitrarily good approximation with high probability in finite time. This is the
first such algorithm for non-parametric (and continuous state space) MDPs with average …