Results of the wmt16 metrics shared task

J Li, T Tang, WX Zhao, JY Nie, JR Wen - ACM Computing Surveys, 2024 - dl.acm.org

Text Generation aims to produce plausible and readable text in human language from input
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …

被引用次数：394 相关文章所有 7 个版本

[PDF] jair.org Full View

Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org

Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

被引用次数：153 相关文章所有 6 个版本

[PDF] aclanthology.org

COMET-22: Unbabel-IST 2022 submission for the metrics shared task

R Rei, JGC De Souza, D Alves, C Zerva… - Proceedings of the …, 2022 - aclanthology.org

In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics
Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …

被引用次数：223 相关文章

[PDF] arxiv.org

COMET: A neural framework for MT evaluation

R Rei, C Stewart, AC Farinha, A Lavie - arXiv preprint arXiv:2009.09025, 2020 - arxiv.org

We present COMET, a neural framework for training multilingual machine translation
evaluation models which obtains new state-of-the-art levels of correlation with human …

被引用次数：941 相关文章所有 7 个版本

[PDF] arxiv.org

BLEURT: Learning robust metrics for text generation

T Sellam, D Das, AP Parikh - arXiv preprint arXiv:2004.04696, 2020 - arxiv.org

Text generation has made significant advances in the last few years. Yet, evaluation metrics
have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …

被引用次数：1461 相关文章所有 6 个版本

[PDF] pkwyx.com

[PDF][PDF] How multilingual is multilingual BERT

T Pires - arXiv preprint arXiv:1906.01502, 2019 - fq.pkwyx.com

In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al.(2018) as
a single language model pre-trained from monolingual corpora in 104 languages, is …

被引用次数：1710 相关文章

[PDF] arxiv.org

Bertscore: Evaluating text generation with bert

T Zhang, V Kishore, F Wu, KQ Weinberger… - arXiv preprint arXiv …, 2019 - arxiv.org

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to
common metrics, BERTScore computes a similarity score for each token in the candidate …

被引用次数：5602 相关文章所有 4 个版本

[PDF] arxiv.org

MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance

W Zhao, M Peyrard, F Liu, Y Gao, CM Meyer… - arXiv preprint arXiv …, 2019 - arxiv.org

A robust evaluation metric has a profound impact on the development of text generation
systems. A desirable metric compares system output against references based on their …

被引用次数：645 相关文章所有 9 个版本

[PDF] arxiv.org

To ship or not to ship: An extensive evaluation of automatic metrics for machine translation

T Kocmi, C Federmann, R Grundkiewicz… - arXiv preprint arXiv …, 2021 - arxiv.org

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of
one machine translation system's quality over another. The community choice of automatic …

被引用次数：212 相关文章所有 3 个版本

[PDF] aclanthology.org

Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain

M Freitag, R Rei, N Mathur, C Lo… - Proceedings of the …, 2021 - aclanthology.org

This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked
to score the outputs of the translation systems competing in the WMT21 News Translation …

被引用次数：171 相关文章所有 8 个版本

高级搜索

QQ 群