MT detection in web-scraped parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk

We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

被引用次数：239 相关文章所有 17 个版本

[PDF] arxiv.org

On the impact of various types of noise on neural machine translation

H Khayrallah, P Koehn - arXiv preprint arXiv:1805.12282, 2018 - arxiv.org

We examine how various types of noise in the parallel training data impact the quality of
neural machine translation systems. We create five types of artificial noise and analyze how …

被引用次数：225 相关文章所有 8 个版本

[PDF] arxiv.org

Multi-domain neural machine translation

S Tars, M Fishel - arXiv preprint arXiv:1805.02282, 2018 - arxiv.org

We present an approach to neural machine translation (NMT) that supports multiple
domains in a single model and allows switching between the domains when translating. The …

被引用次数：209 相关文章所有 18 个版本

[PDF] thecvf.com

Will Large-scale Generative Models Corrupt Future Datasets?

R Hataya, H Bao, H Arai - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Recently proposed large-scale text-to-image generative models such as DALLE 2,
Midjourney, and StableDiffusion can generate high-quality and realistic images from users' …

被引用次数：28 相关文章所有 5 个版本

[PDF] ed.ac.uk

Findings of the WMT 2018 shared task on parallel corpus filtering

P Koehn, H Khayrallah, K Heafield… - EMNLP 2018 Third …, 2018 - research.ed.ac.uk

We posed the shared task of assigning sentence-level quality scores for a very noisy corpus
of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high …

被引用次数：120 相关文章所有 12 个版本

[HTML] mit.edu

Quality at a glance: An audit of web-crawled multilingual datasets

J Kreutzer, I Caswell, L Wang, A Wahab… - Transactions of the …, 2022 - direct.mit.edu

With the success of large-scale pre-training and multilingual modeling in Natural Language
Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets …

被引用次数：111 相关文章所有 19 个版本

[PDF] arxiv.org

Fairness feedback loops: training on synthetic data amplifies bias

S Wyllie, I Shumailov, N Papernot - The 2024 ACM Conference on …, 2024 - dl.acm.org

Model-induced distribution shifts (MIDS) occur as previous model outputs pollute new model
training sets over generations of models. This is known as model collapse in the case of …

被引用次数：2 相关文章所有 4 个版本

[PDF] aclanthology.org

[PDF][PDF] Findings of the wmt 2016 bilingual document alignment shared task

C Buck, P Koehn - Proceedings of the First Conference on …, 2016 - aclanthology.org

Findings of the WMT 2016 Bilingual Document Alignment Shared Task Page 1 Proceedings of
the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 554–563 …

被引用次数：54 相关文章所有 4 个版本

[PDF] arxiv.org

A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

M Perełkiewicz, R Poświata - arXiv preprint arXiv:2407.07630, 2024 - arxiv.org

This article presents a comprehensive review of the challenges associated with using
massive web-mined corpora for the pre-training of large language models (LLMs). This …

[PDF][PDF] Machine translation detection from monolingual web-text

Y Arase, M Zhou - Proceedings of the 51st Annual Meeting of the …, 2013 - aclanthology.org

We propose a method for automatically detecting low-quality Web-text translated by
statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon …

被引用次数：42 相关文章所有 6 个版本

高级搜索

QQ 群