Towards explainable evaluation metrics for machine translation

C Leiter, P Lertvittayakumjorn, M Fomicheva… - Journal of Machine …, 2024 - jmlr.org
Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for
machine translation (for example, COMET or BERTScore) are based on black-box large …

Machine translation meta evaluation through translation accuracy challenge sets

N Moghe, A Fazla, C Amrhein, T Kocmi… - Computational …, 2024 - direct.mit.edu
Recent machine translation (MT) metrics calibrate their effectiveness by correlating with
human judgment. However, these results are often obtained by averaging predictions across …

Metric score landscape challenge (MSLC23): Understanding metrics' performance on a wider landscape of translation quality

C Lo, S Larkin, R Knowles - … of the Eighth Conference on Machine …, 2023 - aclanthology.org
Abstract The Metric Score Landscape Challenge (MSLC23) dataset aims to gain insight into
metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a …

ACES: Translation accuracy challenge sets at WMT 2023

C Amrhein, N Moghe, L Guillou - arXiv preprint arXiv:2311.01153, 2023 - arxiv.org
We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the
ACES Challenge Set (Amrhein et al., 2022). The challenge set consists of 36K examples …

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

T Glushkova, C Zerva, AFT Martins - arXiv preprint arXiv:2305.19144, 2023 - arxiv.org
Although neural-based machine translation evaluation metrics, such as COMET or BLEURT,
have achieved strong correlations with human judgements, they are sometimes unreliable in …

Multifaceted Challenge Set for Evaluating Machine Translation Performance

X Chen, D Wei, Z Wu, T Zhu, H Shang, Z Li… - Proceedings of the …, 2023 - aclanthology.org
Abstract Machine Translation Evaluation is critical to Machine Translation research, as the
evaluation results reflect the effectiveness of training strategies. As a result, a fair and …

Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks

Y Huang, T Baldwin - arXiv preprint arXiv:2311.00508, 2023 - arxiv.org
We investigate MT evaluation metric performance on adversarially-synthesized texts, to
shed light on metric robustness. We experiment with word-and character-level attacks on …

Pulling Out All The Full Stops: Punctuation Sensitivity in Neural Machine Translation and Evaluation

P Jwalapuram - Findings of the Association for Computational …, 2023 - aclanthology.org
Much of the work testing machine translation systems for robustness and sensitivity has
been adversarial or tended towards testing noisy input such as spelling errors, or non …

Segment-level evaluation of machine translation metrics

N Moghe - 2024 - era.ed.ac.uk
Most metrics evaluating Machine Translation (MT) claim their effectiveness by demonstrating
their ability to distinguish the quality of different MT systems over a large corpus (system …

[PDF][PDF] Evaluation of Pre-trained Metrics and ChatGPT as Document-level Machine Translation Metrics

N Bleiker - cl.uzh.ch
Automatic evaluation metrics play an important role in the development and optimization of
machine translation (MT) systems as they are the main method used for evaluating and …