Pre-trained language models for text generation: A survey

J Li, T Tang, WX Zhao, JY Nie, JR Wen - ACM Computing Surveys, 2024 - dl.acm.org
Text Generation aims to produce plausible and readable text in human language from input
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …

Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

COMET-22: Unbabel-IST 2022 submission for the metrics shared task

R Rei, JGC De Souza, D Alves, C Zerva… - Proceedings of the …, 2022 - aclanthology.org
In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics
Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …

COMET: A neural framework for MT evaluation

R Rei, C Stewart, AC Farinha, A Lavie - arXiv preprint arXiv:2009.09025, 2020 - arxiv.org
We present COMET, a neural framework for training multilingual machine translation
evaluation models which obtains new state-of-the-art levels of correlation with human …

BLEURT: Learning robust metrics for text generation

T Sellam, D Das, AP Parikh - arXiv preprint arXiv:2004.04696, 2020 - arxiv.org
Text generation has made significant advances in the last few years. Yet, evaluation metrics
have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …

[PDF][PDF] How multilingual is multilingual BERT

T Pires - arXiv preprint arXiv:1906.01502, 2019 - fq.pkwyx.com
In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al.(2018) as
a single language model pre-trained from monolingual corpora in 104 languages, is …

Bertscore: Evaluating text generation with bert

T Zhang, V Kishore, F Wu, KQ Weinberger… - arXiv preprint arXiv …, 2019 - arxiv.org
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to
common metrics, BERTScore computes a similarity score for each token in the candidate …

MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance

W Zhao, M Peyrard, F Liu, Y Gao, CM Meyer… - arXiv preprint arXiv …, 2019 - arxiv.org
A robust evaluation metric has a profound impact on the development of text generation
systems. A desirable metric compares system output against references based on their …

To ship or not to ship: An extensive evaluation of automatic metrics for machine translation

T Kocmi, C Federmann, R Grundkiewicz… - arXiv preprint arXiv …, 2021 - arxiv.org
Automatic metrics are commonly used as the exclusive tool for declaring the superiority of
one machine translation system's quality over another. The community choice of automatic …

Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain

M Freitag, R Rei, N Mathur, C Lo… - Proceedings of the …, 2021 - aclanthology.org
This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked
to score the outputs of the translation systems competing in the WMT21 News Translation …