Dirt cheap web-scale parallel text from the common crawl

JR Smith, H Saint-Amand, M Plamada, P Koehn… - 2013 - zora.uzh.ch
Parallel text is the fuel that drives modern machine translation systems. The Web is a
comprehensive source of preexisting parallel text, but crawling the entire web is impossible …

JParaCrawl: A large scale web-based English-Japanese parallel corpus

M Morishita, J Suzuki, M Nagata - arXiv preprint arXiv:1911.10668, 2019 - arxiv.org
Recent machine translation algorithms mainly rely on parallel corpora. However, since the
availability of parallel corpora remains limited, only some resource-rich language pairs can …

ParaCrawl: Web-scale acquisition of parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk
We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

[PDF][PDF] ParaCrawl: Web-scale parallel corpora for the languages of the EU

M Esplà-Gomis, ML Forcada… - … , Project and User …, 2019 - aclanthology.org
We describe two projects funded by the Connecting Europe Facility, Provision of Web-Scale
Parallel Corpora for Official European Languages (2016-EU-IA-0114, completed) and …

JParaCrawl v3. 0: A large-scale English-Japanese parallel corpus

M Morishita, K Chousa, J Suzuki, M Nagata - arXiv preprint arXiv …, 2022 - arxiv.org
Most current machine translation models are mainly trained with parallel corpora, and their
translation accuracy largely depends on the quality and quantity of the corpora. Although …

CCAligned: A massive collection of cross-lingual web-document pairs

A El-Kishky, V Chaudhary, F Guzmán… - arXiv preprint arXiv …, 2019 - arxiv.org
Cross-lingual document alignment aims to identify pairs of documents in two distinct
languages that are of comparable content or translations of each other. In this paper, we …

Margin-based parallel corpus mining with multilingual sentence embeddings

M Artetxe, H Schwenk - arXiv preprint arXiv:1811.01136, 2018 - arxiv.org
Machine translation is highly sensitive to the size and quality of the training data, which has
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …

Parallel strands: A preliminary investigation into mining the web for bilingual text

P Resnik - Conference of the Association for Machine Translation …, 1998 - Springer
Parallel corpora are a valuable resource for machine translation, but at present their
availability and utility is limited by genre-and domain-specificity, licensing restrictions, and …

Bifixer and bicleaner: two open-source tools to clean your parallel data

G Ramírez‐Sánchez… - Proceedings of the …, 2020 - aclanthology.org
This paper shows the utility of two open-source tools designed for parallel data cleaning:
Bifixer and Bicleaner. Already used to clean highly noisy parallel content from crawled …

Document-level machine translation with large-scale public parallel corpora

P Pal, A Birch-Mayne, K Heafield - The 62nd Annual Meeting of …, 2024 - research.ed.ac.uk
Despite the fact that document-level machine translation has inherent advantages over
sentence-level machine translation due to additional information available to a model from …