evaluate neural ranking models on complex topics. It builds on TREC Deep Learning (DL)
topics by extensively annotating them with question intent categories, answer types, wikified
entities, topic categories, and result type metadata from a commercial web search engine.
Based on this data, we introduce a framework for identifying challenging queries. DL-HARD
contains fifty topics from the official DL 2019/2020 evaluation benchmark, half of which are …