Margin-based parallel corpus mining with multilingual sentence embeddings

M Artetxe, H Schwenk - arXiv preprint arXiv:1811.01136, 2018 - arxiv.org
Machine translation is highly sensitive to the size and quality of the training data, which has
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …

Filtering and mining parallel data in a joint multilingual space

H Schwenk - arXiv preprint arXiv:1805.09822, 2018 - arxiv.org
We learn a joint multilingual sentence embedding and use the distance between sentences
in different languages to filter noisy parallel data and to mine for parallel data in large news …

Findings of the WMT 2018 shared task on parallel corpus filtering

P Koehn, H Khayrallah, K Heafield… - EMNLP 2018 Third …, 2018 - research.ed.ac.uk
We posed the shared task of assigning sentence-level quality scores for a very noisy corpus
of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high …

Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions

P Koehn, F Guzmán, V Chaudhary… - Proceedings of the Fourth …, 2019 - aclanthology.org
Abstract Following the WMT 2018 Shared Task on Parallel Corpus Filtering, we posed the
challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs …

Effective parallel corpus mining using bilingual sentence embeddings

M Guo, Q Shen, Y Yang, H Ge, D Cer… - arXiv preprint arXiv …, 2018 - arxiv.org
This paper presents an effective approach for parallel corpus mining using bilingual
sentence embeddings. Our embedding models are trained to produce similar …

Findings of the WMT 2020 shared task on parallel corpus filtering and alignment

P Koehn, V Chaudhary, A El-Kishky… - Proceedings of the …, 2020 - aclanthology.org
Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018,
2019), we posed again the challenge of assigning sentence-level quality scores for very …

ParaCrawl: Web-scale acquisition of parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk
We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

Zipporah: a fast and scalable data cleaning system for noisy web-crawled parallel corpora

H Xu, P Koehn - Proceedings of the 2017 conference on empirical …, 2017 - aclanthology.org
We introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type
of bag-of-words translation feature, and train logistic regression models to classify good data …

JParaCrawl: A large scale web-based English-Japanese parallel corpus

M Morishita, J Suzuki, M Nagata - arXiv preprint arXiv:1911.10668, 2019 - arxiv.org
Recent machine translation algorithms mainly rely on parallel corpora. However, since the
availability of parallel corpora remains limited, only some resource-rich language pairs can …

Dual conditional cross-entropy filtering of noisy parallel corpora

M Junczys-Dowmunt - arXiv preprint arXiv:1809.00197, 2018 - arxiv.org
In this work we introduce dual conditional cross-entropy filtering for noisy parallel data. For
each sentence pair of the noisy parallel corpus we compute cross-entropy scores according …