Findings of the WMT 2018 shared task on parallel corpus filtering

P Koehn, H Khayrallah, K Heafield… - EMNLP 2018 Third …, 2018 - research.ed.ac.uk
We posed the shared task of assigning sentence-level quality scores for a very noisy corpus
of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high …

Bitextedit: Automatic bitext editing for improved low-resource machine translation

E Briakou, SI Wang, L Zettlemoyer… - arXiv preprint arXiv …, 2021 - arxiv.org
Mined bitexts can contain imperfect translations that yield unreliable training signals for
Neural Machine Translation (NMT). While filtering such pairs out is known to improve final …

The ARC-NKUA submission for the English-Ukrainian General Machine Translation Shared Task at WMT22

D Roussis, V Papavassiliou - Proceedings of the Seventh …, 2022 - aclanthology.org
Abstract The ARC-NKUA (“Athena” Research Center-National and Kapodistrian University
of Athens) submission to the WMT22 General Machine Translation shared task concerns the …

[PDF][PDF] Building End-to-End Neural Machine Translation Systems for Crisis Scenarios: The Case of COVID-19

DG Roussis - 2022 - core.ac.uk
Machine Translation is a crucial task of Natural Language Processing, as it aims to provide a
fast and automatic way of translating various types of texts. In recent years, the emergence of …

uniblock: Scoring and Filtering Corpus with Unicode Block Information

Y Gao, W Wang, H Ney - arXiv preprint arXiv:1908.09716, 2019 - arxiv.org
The preprocessing pipelines in Natural Language Processing usually involve a step of
removing sentences consisted of illegal characters. The definition of illegal characters and …