Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

An empirical survey on long document summarization: Datasets, models, and metrics

HY Koh, J Ju, M Liu, S Pan - ACM computing surveys, 2022 - dl.acm.org
Long documents such as academic articles and business reports have been the standard
format to detail out important issues and complicated subjects that require extra attention. An …

Learning to summarize with human feedback

N Stiennon, L Ouyang, J Wu… - Advances in …, 2020 - proceedings.neurips.cc
As language models become more powerful, training and evaluation are increasingly
bottlenecked by the data and metrics used for a particular task. For example, summarization …

Summeval: Re-evaluating summarization evaluation

AR Fabbri, W Kryściński, B McCann, C Xiong… - Transactions of the …, 2021 - direct.mit.edu
The scarcity of comprehensive up-to-date studies on evaluation metrics for text
summarization and the lack of consensus regarding evaluation protocols continue to inhibit …

On faithfulness and factuality in abstractive summarization

J Maynez, S Narayan, B Bohnet… - arXiv preprint arXiv …, 2020 - arxiv.org
It is well known that the standard likelihood training and approximate decoding objectives in
neural text generation models lead to less human-like responses for open-ended tasks such …

Pegasus: Pre-training with extracted gap-sentences for abstractive summarization

J Zhang, Y Zhao, M Saleh, P Liu - … conference on machine …, 2020 - proceedings.mlr.press
Recent work pre-training Transformers with self-supervised objectives on large text corpora
has shown great success when fine-tuned on downstream NLP tasks including text …

Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics

A Pagnoni, V Balachandran, Y Tsvetkov - arXiv preprint arXiv:2104.13346, 2021 - arxiv.org
Modern summarization models generate highly fluent but often factually unreliable outputs.
This motivated a surge of metrics attempting to measure the factuality of automatically …

Ctrl: A conditional transformer language model for controllable generation

NS Keskar, B McCann, LR Varshney, C Xiong… - arXiv preprint arXiv …, 2019 - arxiv.org
Large-scale language models show promising text generation capabilities, but users cannot
easily control particular aspects of the generated text. We release CTRL, a 1.63 billion …

Fine-tuning language models from human preferences

DM Ziegler, N Stiennon, J Wu, TB Brown… - arXiv preprint arXiv …, 2019 - arxiv.org
Reward learning enables the application of reinforcement learning (RL) to tasks where
reward is defined by human judgment, building a model of reward by asking humans …

Evaluation of text generation: A survey

A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, 2020 - arxiv.org
The paper surveys evaluation methods of natural language generation (NLG) systems that
have been developed in the last few years. We group NLG evaluation methods into three …