policy, where $ n $ samples are drawn from a base policy, and ranked based on a reward
function, and the highest ranking one is selected. A commonly used analytical expression in
the literature claims that the KL divergence between the best-of-$ n $ policy and the base
policy is equal to $\log (n)-(n-1)/n. $ We disprove the validity of this claim, and show that it is
an upper bound on the actual KL divergence. We also explore the tightness of this upper …