In the last few years, a large number of automatic evaluation metrics have been proposed for evaluating Natural Language Generation (NLG) systems. The rapid development and …
Evaluating the factuality of long-form text generated by large language models (LMs) is non- trivial because (1) generations often contain a mixture of supported and unsupported pieces …
W Yuan, G Neubig, P Liu - Advances in Neural Information …, 2021 - proceedings.neurips.cc
A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate …
In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …
R Rei, C Stewart, AC Farinha, A Lavie - arXiv preprint arXiv:2009.09025, 2020 - arxiv.org
We present COMET, a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human …
Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine …
Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic …
This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT21 News Translation …