Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

Efficient methods for natural language processing: A survey

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu
Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

Beyond mahalanobis distance for textual ood detection

P Colombo, E Dadalto, G Staerman… - Advances in …, 2022 - proceedings.neurips.cc
As the number of AI systems keeps growing, it is fundamental to implement and develop
efficient control mechanisms to ensure the safe and proper functioning of machine learning …

Menli: Robust evaluation metrics from natural language inference

Y Chen, S Eger - Transactions of the Association for Computational …, 2023 - direct.mit.edu
Recently proposed BERT-based evaluation metrics for text generation perform well on
standard benchmarks but are vulnerable to adversarial attacks, eg, relating to information …

Do language models enjoy their own stories? prompting large language models for automatic story evaluation

C Chhun, FM Suchanek, C Clavel - Transactions of the Association …, 2024 - direct.mit.edu
Storytelling is an integral part of human experience and plays a crucial role in social
interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit …

Learning disentangled textual representations via statistical measures of similarity

P Colombo, G Staerman, N Noiry… - arXiv preprint arXiv …, 2022 - arxiv.org
When working with textual data, a natural application of disentangled representations is fair
classification where the goal is to make predictions without being biased (or influenced) by …

Infolm: A new metric to evaluate summarization & data2text generation

PJA Colombo, C Clavel, P Piantanida - Proceedings of the AAAI …, 2022 - ojs.aaai.org
Assessing the quality of natural language generation (NLG) systems through human
annotation is very expensive. Additionally, human annotation campaigns are time …

What are the best systems? new perspectives on nlp benchmarking

P Colombo, N Noiry, E Irurozki… - Advances in Neural …, 2022 - proceedings.neurips.cc
Abstract In Machine Learning, a benchmark refers to an ensemble of datasets associated
with one or multiple metrics together with a way to aggregate different systems …

What makes a good story and how can we measure it? a comprehensive survey of story evaluation

D Yang, Q Jin - arXiv preprint arXiv:2408.14622, 2024 - arxiv.org
With the development of artificial intelligence, particularly the success of Large Language
Models (LLMs), the quantity and quality of automatically generated stories have significantly …

Unsupervised extractive opinion summarization using sparse coding

SBR Chowdhury, C Zhao, S Chaturvedi - arXiv preprint arXiv:2203.07921, 2022 - arxiv.org
Opinion summarization is the task of automatically generating summaries that encapsulate
information from multiple user reviews. We present Semantic Autoencoder (SemAE) to …