Evaluating question answering evaluation

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org

Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …

被引用次数：175 相关文章所有 6 个版本

[PDF] arxiv.org

A survey of evaluation metrics used for NLG systems

AB Sai, AK Mohankumar, MM Khapra - ACM Computing Surveys (CSUR …, 2022 - dl.acm.org

In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …

被引用次数：223 相关文章所有 4 个版本

[PDF] neurips.cc

Benchmarking foundation models with language-model-as-an-examiner

Y Bai, J Ying, Y Cao, X Lv, Y He… - Advances in …, 2024 - proceedings.neurips.cc

Numerous benchmarks have been established to assess the performance of foundation
models on open-ended question answering, which serves as a comprehensive test of a …

被引用次数：60 相关文章所有 6 个版本

[PDF] arxiv.org

Evaluating open-domain question answering in the era of large language models

E Kamalloo, N Dziri, CLA Clarke, D Rafiei - arXiv preprint arXiv …, 2023 - arxiv.org

Lexical matching remains the de facto evaluation method for open-domain question
answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate …

被引用次数：84 相关文章所有 6 个版本

[HTML] sciencedirect.com

[HTML][HTML] Human evaluation of automatically generated text: Current trends and best practice guidelines

C van der Lee, A Gatt, E van Miltenburg… - Computer Speech & …, 2021 - Elsevier

Currently, there is little agreement as to how Natural Language Generation (NLG) systems
should be evaluated, with a particularly high degree of variation in the way that human …

被引用次数：155 相关文章所有 2 个版本

[PDF] arxiv.org

Selective question answering under domain shift

A Kamath, R Jia, P Liang - arXiv preprint arXiv:2006.09462, 2020 - arxiv.org

To avoid giving wrong answers, question answering (QA) models need to know when to
abstain from answering. Moreover, users often ask questions that diverge from the model's …

被引用次数：162 相关文章所有 3 个版本

[PDF] arxiv.org

Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking

JG Zhang, K Hashimoto, CS Wu, Y Wan, PS Yu… - arXiv preprint arXiv …, 2019 - arxiv.org

Dialog state tracking (DST) is a core component in task-oriented dialog systems. Existing
approaches for DST mainly fall into one of two categories, namely, ontology-based and …

被引用次数：179 相关文章所有 4 个版本

[PDF] arxiv.org

Human evaluation of conversations is an open problem: comparing the sensitivity of various methods for evaluating dialogue agents

EM Smith, O Hsu, R Qian, S Roller, YL Boureau… - arXiv preprint arXiv …, 2022 - arxiv.org

At the heart of improving conversational AI is the open problem of how to evaluate
conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv …

被引用次数：57 相关文章所有 8 个版本

[PDF] arxiv.org

Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation

J Bulian, C Buck, W Gajewski, B Boerschinger… - arXiv preprint arXiv …, 2022 - arxiv.org

The predictions of question answering (QA) systems are typically evaluated against
manually annotated finite sets of one or more answers. This leads to a coverage limitation …

被引用次数：45 相关文章所有 6 个版本

[PDF] arxiv.org

A survey on machine reading comprehension systems

R Baradaran, R Ghiasi, H Amirkhani - Natural Language Engineering, 2022 - cambridge.org

Machine Reading Comprehension (MRC) is a challenging task and hot topic in Natural
Language Processing. The goal of this field is to develop systems for answering the …

被引用次数：97 相关文章所有 8 个版本

高级搜索

QQ 群