Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

S Min, K Krishna, X Lyu, M Lewis, W Yih… - arXiv preprint arXiv …, 2023 - arxiv.org
Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …

Enabling large language models to generate text with citations

T Gao, H Yen, J Yu, D Chen - arXiv preprint arXiv:2305.14627, 2023 - arxiv.org
Large language models (LLMs) have emerged as a widely-used tool for information
seeking, but their generated outputs are prone to hallucination. In this work, our aim is to …

LongEval: Guidelines for human evaluation of faithfulness in long-form summarization

K Krishna, E Bransom, B Kuehl, M Iyyer… - arXiv preprint arXiv …, 2023 - arxiv.org
While human evaluation remains best practice for accurately judging the faithfulness of
automatically-generated summaries, few solutions exist to address the increased difficulty …

How to train long-context language models (effectively)

T Gao, A Wettig, H Yen, D Chen - arXiv preprint arXiv:2410.02660, 2024 - arxiv.org
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to
make effective use of long-context information. We first establish a reliable evaluation …

Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization

Y Liu, AR Fabbri, J Chen, Y Zhao, S Han, S Joty… - arXiv preprint arXiv …, 2023 - arxiv.org
While large language models (LLMs) already achieve strong performance on standard
generic summarization benchmarks, their performance on more complex summarization …

Extractive is not faithful: An investigation of broad unfaithfulness problems in extractive summarization

S Zhang, D Wan, M Bansal - arXiv preprint arXiv:2209.03549, 2022 - arxiv.org
The problems of unfaithful summaries have been widely discussed under the context of
abstractive summarization. Though extractive summarization is less prone to the common …

On the limitations of reference-free evaluations of generated text

D Deutsch, R Dror, D Roth - arXiv preprint arXiv:2210.12563, 2022 - arxiv.org
There is significant interest in developing evaluation metrics which accurately estimate the
quality of generated text without the aid of a human-written reference text, which can be time …

Helmet: How to evaluate long-context language models effectively and thoroughly

H Yen, T Gao, M Hou, K Ding, D Fleischer… - arXiv preprint arXiv …, 2024 - arxiv.org
There have been many benchmarks for evaluating long-context language models (LCLMs),
but developers often rely on synthetic tasks like needle-in-a-haystack (NIAH) or arbitrary …

Molecular facts: Desiderata for decontextualization in llm fact verification

A Gunjal, G Durrett - arXiv preprint arXiv:2406.20079, 2024 - arxiv.org
Automatic factuality verification of large language model (LLM) generations is becoming
more and more widely used to combat hallucinations. A major point of tension in the …