Findings of the WMT 2018 shared task on parallel corpus filtering

P Koehn, H Khayrallah, K Heafield… - EMNLP 2018 Third …, 2018 - research.ed.ac.uk
We posed the shared task of assigning sentence-level quality scores for a very noisy corpus
of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high …

Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions

P Koehn, F Guzmán, V Chaudhary… - Proceedings of the Fourth …, 2019 - aclanthology.org
Abstract Following the WMT 2018 Shared Task on Parallel Corpus Filtering, we posed the
challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs …

Findings of the WMT 2020 shared task on parallel corpus filtering and alignment

P Koehn, V Chaudhary, A El-Kishky… - Proceedings of the …, 2020 - aclanthology.org
Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018,
2019), we posed again the challenge of assigning sentence-level quality scores for very …

Prompsit's submission to WMT 2018 parallel corpus filtering shared task

VM Sánchez-Cartagena, M Bañón… - Proceedings of the …, 2018 - aclanthology.org
Abstract This paper describes Prompsit Language Engineering's submissions to the WMT
2018 parallel corpus filtering shared task. Our four submissions were based on an automatic …

Parallel corpus filtering via pre-trained language models

B Zhang, A Nagesh, K Knight - arXiv preprint arXiv:2005.06166, 2020 - arxiv.org
Web-crawled data provides a good source of parallel corpora for training machine
translation models. It is automatically obtained, but extremely noisy, and recent work shows …

OpusFilter: A configurable parallel corpus filtering toolbox

M Aulamo, S Virpioja… - … Annual Conference of …, 2020 - researchportal.helsinki.fi
This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora.
It implements a number of components based on heuristic filters, language identification …

Bifixer and bicleaner: two open-source tools to clean your parallel data

G Ramírez‐Sánchez… - Proceedings of the …, 2020 - aclanthology.org
This paper shows the utility of two open-source tools designed for parallel data cleaning:
Bifixer and Bicleaner. Already used to clean highly noisy parallel content from crawled …

Margin-based parallel corpus mining with multilingual sentence embeddings

M Artetxe, H Schwenk - arXiv preprint arXiv:1811.01136, 2018 - arxiv.org
Machine translation is highly sensitive to the size and quality of the training data, which has
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …

Microsoft's submission to the wmt2018 news translation task: How i learned to stop worrying and love the data

M Junczys-Dowmunt - arXiv preprint arXiv:1809.00196, 2018 - arxiv.org
This paper describes the Microsoft submission to the WMT2018 news translation shared
task. We participated in one language direction--English-German. Our system follows …

Dual conditional cross-entropy filtering of noisy parallel corpora

M Junczys-Dowmunt - arXiv preprint arXiv:1809.00197, 2018 - arxiv.org
In this work we introduce dual conditional cross-entropy filtering for noisy parallel data. For
each sentence pair of the noisy parallel corpus we compute cross-entropy scores according …