Parallel corpus filtering via pre-trained language models

B Zhang, A Nagesh, K Knight - arXiv preprint arXiv:2005.06166, 2020 - arxiv.org
Web-crawled data provides a good source of parallel corpora for training machine
translation models. It is automatically obtained, but extremely noisy, and recent work shows …

Filtering noisy parallel corpus using transformers with proxy task learning

H Açarçiçek, T Çolakoğlu, PEA Hatipoğlu… - Proceedings of the …, 2020 - aclanthology.org
This paper illustrates Huawei's submission to the WMT20 low-resource parallel corpus
filtering shared task. Our approach focuses on developing a proxy task learner on top of a …

Findings of the WMT 2018 shared task on parallel corpus filtering

P Koehn, H Khayrallah, K Heafield… - EMNLP 2018 Third …, 2018 - research.ed.ac.uk
We posed the shared task of assigning sentence-level quality scores for a very noisy corpus
of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high …

JParaCrawl: A large scale web-based English-Japanese parallel corpus

M Morishita, J Suzuki, M Nagata - arXiv preprint arXiv:1911.10668, 2019 - arxiv.org
Recent machine translation algorithms mainly rely on parallel corpora. However, since the
availability of parallel corpora remains limited, only some resource-rich language pairs can …

Findings of the WMT 2020 shared task on parallel corpus filtering and alignment

P Koehn, V Chaudhary, A El-Kishky… - Proceedings of the …, 2020 - aclanthology.org
Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018,
2019), we posed again the challenge of assigning sentence-level quality scores for very …

Pseudotext Injection and Advance Filtering of Low‐Resource Corpus for Neural Machine Translation

M Adjeisah, G Liu, DO Nyabuga… - Computational …, 2021 - Wiley Online Library
Scaling natural language processing (NLP) to low‐resourced languages to improve
machine translation (MT) performance remains enigmatic. This research contributes to the …

Tilde's parallel corpus filtering methods for WMT 2018

M Pinnis - Proceedings of the Third Conference on Machine …, 2018 - aclanthology.org
The paper describes parallel corpus filtering methods that allow reducing noise of noisy
“parallel” corpora from a level where the corpora are not usable for neural machine …

Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions

P Koehn, F Guzmán, V Chaudhary… - Proceedings of the Fourth …, 2019 - aclanthology.org
Abstract Following the WMT 2018 Shared Task on Parallel Corpus Filtering, we posed the
challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs …

Machine translation with weakly paired documents

L Wu, J Zhu, D He, F Gao, T Qin, J Lai… - Proceedings of the 2019 …, 2019 - aclanthology.org
Neural machine translation, which achieves near human-level performance in some
languages, strongly relies on the large amounts of parallel sentences, which hinders its …

The RWTH Aachen University filtering system for the WMT 2018 parallel corpus filtering task

N Rossenbach, J Rosendahl, Y Kim… - Proceedings of the …, 2018 - aclanthology.org
This paper describes the submission of RWTH Aachen University for the De→ En parallel
corpus filtering task of the EMNLP 2018 Third Conference on Machine Translation (WMT …