ParaCrawl: Web-scale acquisition of parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk
We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

On the impact of various types of noise on neural machine translation

H Khayrallah, P Koehn - arXiv preprint arXiv:1805.12282, 2018 - arxiv.org
We examine how various types of noise in the parallel training data impact the quality of
neural machine translation systems. We create five types of artificial noise and analyze how …

Multi-domain neural machine translation

S Tars, M Fishel - arXiv preprint arXiv:1805.02282, 2018 - arxiv.org
We present an approach to neural machine translation (NMT) that supports multiple
domains in a single model and allows switching between the domains when translating. The …

Will Large-scale Generative Models Corrupt Future Datasets?

R Hataya, H Bao, H Arai - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Recently proposed large-scale text-to-image generative models such as DALLE 2,
Midjourney, and StableDiffusion can generate high-quality and realistic images from users' …

Findings of the WMT 2018 shared task on parallel corpus filtering

P Koehn, H Khayrallah, K Heafield… - EMNLP 2018 Third …, 2018 - research.ed.ac.uk
We posed the shared task of assigning sentence-level quality scores for a very noisy corpus
of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high …

Quality at a glance: An audit of web-crawled multilingual datasets

J Kreutzer, I Caswell, L Wang, A Wahab… - Transactions of the …, 2022 - direct.mit.edu
With the success of large-scale pre-training and multilingual modeling in Natural Language
Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets …

Fairness feedback loops: training on synthetic data amplifies bias

S Wyllie, I Shumailov, N Papernot - The 2024 ACM Conference on …, 2024 - dl.acm.org
Model-induced distribution shifts (MIDS) occur as previous model outputs pollute new model
training sets over generations of models. This is known as model collapse in the case of …

[PDF][PDF] Findings of the wmt 2016 bilingual document alignment shared task

C Buck, P Koehn - Proceedings of the First Conference on …, 2016 - aclanthology.org
Findings of the WMT 2016 Bilingual Document Alignment Shared Task Page 1 Proceedings of
the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 554–563 …

A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

M Perełkiewicz, R Poświata - arXiv preprint arXiv:2407.07630, 2024 - arxiv.org
This article presents a comprehensive review of the challenges associated with using
massive web-mined corpora for the pre-training of large language models (LLMs). This …

[PDF][PDF] Machine translation detection from monolingual web-text

Y Arase, M Zhou - Proceedings of the 51st Annual Meeting of the …, 2013 - aclanthology.org
We propose a method for automatically detecting low-quality Web-text translated by
statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon …