evaluating topic model output and determining the optimal number of topics both have been
longstanding challenges, with no effective automated solutions to date. This paper proposes
using large language models to evaluate such output. We find that large language models
appropriately assess the resulting topics, correlating more strongly with human judgments
than existing automated metrics. We then investigate whether we can use large language …