Evaluating large language models: A comprehensive survey

Z Guo, R Jin, C Liu, Y Huang, D Shi, L Yu, Y Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have demonstrated remarkable capabilities across a broad
spectrum of tasks. They have attracted significant attention and been deployed in numerous …

AlignScore: Evaluating factual consistency with a unified alignment function

Y Zha, Y Yang, R Li, Z Hu - arXiv preprint arXiv:2305.16739, 2023 - arxiv.org
Many text generation applications require the generated text to be factually consistent with
input information. Automatic evaluation of factual consistency is challenging. Previous work …

Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding

H Liu, J Liu, L Cui, Z Teng, N Duan… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
NLP research on logical reasoning regains momentum with the recent releases of a handful
of datasets, notably LogiQA and Reclor. Logical reasoning is exploited in many probing …

Trueteacher: Learning factual consistency evaluation with large language models

Z Gekhman, J Herzig, R Aharoni, C Elkind… - arXiv preprint arXiv …, 2023 - arxiv.org
Factual consistency evaluation is often conducted using Natural Language Inference (NLI)
models, yet these models exhibit limited success in evaluating summaries. Previous work …

Menli: Robust evaluation metrics from natural language inference

Y Chen, S Eger - Transactions of the Association for Computational …, 2023 - direct.mit.edu
Recently proposed BERT-based evaluation metrics for text generation perform well on
standard benchmarks but are vulnerable to adversarial attacks, eg, relating to information …

Factkb: Generalizable factuality evaluation using language models enhanced with factual knowledge

S Feng, V Balachandran, Y Bai, Y Tsvetkov - arXiv preprint arXiv …, 2023 - arxiv.org
Evaluating the factual consistency of automatically generated summaries is essential for the
progress and adoption of reliable summarization systems. Despite recent advances, existing …

Stretching sentence-pair NLI models to reason over long documents and clusters

T Schuster, S Chen, S Buthpitiya, A Fabrikant… - arXiv preprint arXiv …, 2022 - arxiv.org
Natural Language Inference (NLI) has been extensively studied by the NLP community as a
framework for estimating the semantic relation between sentence pairs. While early work …

FineSurE: Fine-grained summarization evaluation using LLMs

H Song, H Su, I Shalyminov, J Cai… - arXiv preprint arXiv …, 2024 - arxiv.org
Automated evaluation is crucial for streamlining text summarization benchmarking and
model development, given the costly and time-consuming nature of human evaluation …

NonFactS: NonFactual summary generation for factuality evaluation in document summarization

A Soleimani, C Monz, M Worring - Findings of the Association for …, 2023 - aclanthology.org
Pre-trained abstractive summarization models can generate fluent summaries and achieve
high ROUGE scores. Previous research has found that these models often generate …

Zero-shot faithfulness evaluation for text summarization with foundation language model

Q Jia, S Ren, Y Liu, KQ Zhu - arXiv preprint arXiv:2310.11648, 2023 - arxiv.org
Despite tremendous improvements in natural language generation, summarization models
still suffer from the unfaithfulness issue. Previous work evaluates faithfulness either using …