Findings of the WMT 2020 shared task on parallel corpus filtering and alignment

P Koehn, V Chaudhary, A El-Kishky… - Proceedings of the …, 2020 - aclanthology.org
Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018,
2019), we posed again the challenge of assigning sentence-level quality scores for very …

Findings of the WMT 2018 shared task on parallel corpus filtering

P Koehn, H Khayrallah, K Heafield… - EMNLP 2018 Third …, 2018 - research.ed.ac.uk
We posed the shared task of assigning sentence-level quality scores for a very noisy corpus
of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high …

Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions

P Koehn, F Guzmán, V Chaudhary… - Proceedings of the Fourth …, 2019 - aclanthology.org
Abstract Following the WMT 2018 Shared Task on Parallel Corpus Filtering, we posed the
challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs …

Prompsit's submission to WMT 2018 parallel corpus filtering shared task

VM Sánchez-Cartagena, M Bañón… - Proceedings of the …, 2018 - aclanthology.org
Abstract This paper describes Prompsit Language Engineering's submissions to the WMT
2018 parallel corpus filtering shared task. Our four submissions were based on an automatic …

Parallel corpus filtering via pre-trained language models

B Zhang, A Nagesh, K Knight - arXiv preprint arXiv:2005.06166, 2020 - arxiv.org
Web-crawled data provides a good source of parallel corpora for training machine
translation models. It is automatically obtained, but extremely noisy, and recent work shows …

Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the …

C Lo, M Simard, D Stewart, S Larkin… - Proceedings of the …, 2018 - aclanthology.org
We present our semantic textual similarity approach in filtering a noisy web crawled parallel
corpus using YiSi—a novel semantic machine translation evaluation metric. The systems …

OpusFilter: A configurable parallel corpus filtering toolbox

M Aulamo, S Virpioja… - … Annual Conference of …, 2020 - researchportal.helsinki.fi
This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora.
It implements a number of components based on heuristic filters, language identification …

Microsoft's submission to the wmt2018 news translation task: How i learned to stop worrying and love the data

M Junczys-Dowmunt - arXiv preprint arXiv:1809.00196, 2018 - arxiv.org
This paper describes the Microsoft submission to the WMT2018 news translation shared
task. We participated in one language direction--English-German. Our system follows …

Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation

T Hasan, A Bhattacharjee, K Samin, M Hasan… - arXiv preprint arXiv …, 2020 - arxiv.org
Despite being the seventh most widely spoken language in the world, Bengali has received
much less attention in machine translation literature due to being low in resources. Most …

The impact of sentence alignment errors on phrase-based machine translation performance

C Goutte, M Carpuat, G Foster - … of the 10th conference of the …, 2012 - aclanthology.org
When parallel or comparable corpora are harvested from the web, there is typically a
tradeoff between the size and quality of the data. In order to improve quality, corpus …