Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org
Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …

A survey of evaluation metrics used for NLG systems

AB Sai, AK Mohankumar, MM Khapra - ACM Computing Surveys (CSUR …, 2022 - dl.acm.org
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …

Benchmarking foundation models with language-model-as-an-examiner

Y Bai, J Ying, Y Cao, X Lv, Y He… - Advances in …, 2024 - proceedings.neurips.cc
Numerous benchmarks have been established to assess the performance of foundation
models on open-ended question answering, which serves as a comprehensive test of a …

Evaluating open-domain question answering in the era of large language models

E Kamalloo, N Dziri, CLA Clarke, D Rafiei - arXiv preprint arXiv …, 2023 - arxiv.org
Lexical matching remains the de facto evaluation method for open-domain question
answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate …

[HTML][HTML] Human evaluation of automatically generated text: Current trends and best practice guidelines

C van der Lee, A Gatt, E van Miltenburg… - Computer Speech & …, 2021 - Elsevier
Currently, there is little agreement as to how Natural Language Generation (NLG) systems
should be evaluated, with a particularly high degree of variation in the way that human …

Selective question answering under domain shift

A Kamath, R Jia, P Liang - arXiv preprint arXiv:2006.09462, 2020 - arxiv.org
To avoid giving wrong answers, question answering (QA) models need to know when to
abstain from answering. Moreover, users often ask questions that diverge from the model's …

Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking

JG Zhang, K Hashimoto, CS Wu, Y Wan, PS Yu… - arXiv preprint arXiv …, 2019 - arxiv.org
Dialog state tracking (DST) is a core component in task-oriented dialog systems. Existing
approaches for DST mainly fall into one of two categories, namely, ontology-based and …

Human evaluation of conversations is an open problem: comparing the sensitivity of various methods for evaluating dialogue agents

EM Smith, O Hsu, R Qian, S Roller, YL Boureau… - arXiv preprint arXiv …, 2022 - arxiv.org
At the heart of improving conversational AI is the open problem of how to evaluate
conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv …

Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation

J Bulian, C Buck, W Gajewski, B Boerschinger… - arXiv preprint arXiv …, 2022 - arxiv.org
The predictions of question answering (QA) systems are typically evaluated against
manually annotated finite sets of one or more answers. This leads to a coverage limitation …

A survey on machine reading comprehension systems

R Baradaran, R Ghiasi, H Amirkhani - Natural Language Engineering, 2022 - cambridge.org
Machine Reading Comprehension (MRC) is a challenging task and hot topic in Natural
Language Processing. The goal of this field is to develop systems for answering the …