Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

Deep reinforcement and transfer learning for abstractive text summarization: A review

A Alomari, N Idris, AQM Sabri, I Alsmadi - Computer Speech & Language, 2022 - Elsevier
Abstract Automatic Text Summarization (ATS) is an important area in Natural Language
Processing (NLP) with the goal of shortening a long text into a more compact version by …

SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization

P Laban, T Schnabel, PN Bennett… - Transactions of the …, 2022 - direct.mit.edu
In the summarization domain, a key requirement for summaries is to be factually consistent
with the input document. Previous work has found that natural language inference (NLI) …

Summeval: Re-evaluating summarization evaluation

AR Fabbri, W Kryściński, B McCann, C Xiong… - Transactions of the …, 2021 - direct.mit.edu
The scarcity of comprehensive up-to-date studies on evaluation metrics for text
summarization and the lack of consensus regarding evaluation protocols continue to inhibit …

Recursively summarizing books with human feedback

J Wu, L Ouyang, DM Ziegler, N Stiennon… - arXiv preprint arXiv …, 2021 - arxiv.org
A major challenge for scaling machine learning is training models to perform tasks that are
very difficult or time-consuming for humans to evaluate. We present progress on this …

Ctrl: A conditional transformer language model for controllable generation

NS Keskar, B McCann, LR Varshney, C Xiong… - arXiv preprint arXiv …, 2019 - arxiv.org
Large-scale language models show promising text generation capabilities, but users cannot
easily control particular aspects of the generated text. We release CTRL, a 1.63 billion …

Asking and answering questions to evaluate the factual consistency of summaries

A Wang, K Cho, M Lewis - arXiv preprint arXiv:2004.04228, 2020 - arxiv.org
Practical applications of abstractive summarization models are limited by frequent factual
inconsistencies with respect to their input. Existing automatic evaluation metrics for …

QuestEval: Summarization asks for fact-based evaluation

T Scialom, PA Dray, P Gallinari, S Lamprier… - arXiv preprint arXiv …, 2021 - arxiv.org
Summarization evaluation remains an open research problem: current metrics such as
ROUGE are known to be limited and to correlate poorly with human judgments. To alleviate …

FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization

E Durmus, H He, M Diab - arXiv preprint arXiv:2005.03754, 2020 - arxiv.org
Neural abstractive summarization models are prone to generate content inconsistent with
the source document, ie unfaithful. Existing automatic metrics do not capture such mistakes …

QAFactEval: Improved QA-based factual consistency evaluation for summarization

AR Fabbri, CS Wu, W Liu, C Xiong - arXiv preprint arXiv:2112.08542, 2021 - arxiv.org
Factual consistency is an essential quality of text summarization models in practical settings.
Existing work in evaluating this dimension can be broadly categorized into two lines of …