S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become …
In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …
R Rei, C Stewart, AC Farinha, A Lavie - arXiv preprint arXiv:2009.09025, 2020 - arxiv.org
We present COMET, a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human …
Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …
T Pires - arXiv preprint arXiv:1906.01502, 2019 - fq.pkwyx.com
In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al.(2018) as a single language model pre-trained from monolingual corpora in 104 languages, is …
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate …
A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their …
Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic …
This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT21 News Translation …