Survey of hallucination in natural language generation

Z Ji, N Lee, R Frieske, T Yu, D Su, Y Xu, E Ishii… - ACM Computing …, 2023 - dl.acm.org
Natural Language Generation (NLG) has improved exponentially in recent years thanks to
the development of sequence-to-sequence deep learning technologies such as Transformer …

A survey of natural language generation

C Dong, Y Li, H Gong, M Chen, J Li, Y Shen… - ACM Computing …, 2022 - dl.acm.org
This article offers a comprehensive review of the research on Natural Language Generation
(NLG) over the past two decades, especially in relation to data-to-text generation and text-to …

G-eval: Nlg evaluation using gpt-4 with better human alignment

Y Liu, D Iter, Y Xu, S Wang, R Xu, C Zhu - arXiv preprint arXiv:2303.16634, 2023 - arxiv.org
The quality of texts generated by natural language generation (NLG) systems is hard to
measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE …

Siren's song in the AI ocean: a survey on hallucination in large language models

Y Zhang, Y Li, L Cui, D Cai, L Liu, T Fu… - arXiv preprint arXiv …, 2023 - arxiv.org
While large language models (LLMs) have demonstrated remarkable capabilities across a
range of downstream tasks, a significant concern revolves around their propensity to exhibit …

Benchmarking large language models in retrieval-augmented generation

J Chen, H Lin, X Han, L Sun - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the
hallucination of large language models (LLMs). However, existing research lacks rigorous …

SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization

P Laban, T Schnabel, PN Bennett… - Transactions of the …, 2022 - direct.mit.edu
In the summarization domain, a key requirement for summaries is to be factually consistent
with the input document. Previous work has found that natural language inference (NLI) …

Towards a unified multi-dimensional evaluator for text generation

M Zhong, Y Liu, D Yin, Y Mao, Y Jiao, P Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural
Language Generation (NLG), ie, evaluating the generated text from multiple explainable …

Rarr: Researching and revising what language models say, using language models

L Gao, Z Dai, P Pasupat, A Chen, AT Chaganty… - arXiv preprint arXiv …, 2022 - arxiv.org
Language models (LMs) now excel at many tasks such as few-shot learning, question
answering, reasoning, and dialog. However, they sometimes generate unsupported or …

Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

A culturally sensitive test to evaluate nuanced gpt hallucination

TR McIntosh, T Liu, T Susnjak, P Watters… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
The Generative Pre-trained Transformer (GPT) models, renowned for generating human-like
text, occasionally produce “hallucinations”-outputs that diverge from human expectations …