相关文章- 学术资源搜索

Dirt cheap web-scale parallel text from the common crawl

JR Smith, H Saint-Amand, M Plamada, P Koehn… - 2013 - zora.uzh.ch

Parallel text is the fuel that drives modern machine translation systems. The Web is a
comprehensive source of preexisting parallel text, but crawling the entire web is impossible …

被引用次数：179 相关文章所有 14 个版本

[PDF] arxiv.org

JParaCrawl: A large scale web-based English-Japanese parallel corpus

M Morishita, J Suzuki, M Nagata - arXiv preprint arXiv:1911.10668, 2019 - arxiv.org

Recent machine translation algorithms mainly rely on parallel corpora. However, since the
availability of parallel corpora remains limited, only some resource-rich language pairs can …

被引用次数：66 相关文章所有 4 个版本

[PDF] strath.ac.uk

ParaCrawl: Web-scale acquisition of parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk

We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

被引用次数：274 相关文章所有 17 个版本

[PDF] aclanthology.org

[PDF][PDF] ParaCrawl: Web-scale parallel corpora for the languages of the EU

M Esplà-Gomis, ML Forcada… - … , Project and User …, 2019 - aclanthology.org

We describe two projects funded by the Connecting Europe Facility, Provision of Web-Scale
Parallel Corpora for Official European Languages (2016-EU-IA-0114, completed) and …

被引用次数：124 相关文章所有 3 个版本

[PDF] arxiv.org

JParaCrawl v3. 0: A large-scale English-Japanese parallel corpus

M Morishita, K Chousa, J Suzuki, M Nagata - arXiv preprint arXiv …, 2022 - arxiv.org

Most current machine translation models are mainly trained with parallel corpora, and their
translation accuracy largely depends on the quality and quantity of the corpora. Although …

被引用次数：29 相关文章所有 7 个版本

[PDF] arxiv.org

CCAligned: A massive collection of cross-lingual web-document pairs

A El-Kishky, V Chaudhary, F Guzmán… - arXiv preprint arXiv …, 2019 - arxiv.org

Cross-lingual document alignment aims to identify pairs of documents in two distinct
languages that are of comparable content or translations of each other. In this paper, we …

被引用次数：184 相关文章所有 9 个版本

[PDF] arxiv.org

Margin-based parallel corpus mining with multilingual sentence embeddings

M Artetxe, H Schwenk - arXiv preprint arXiv:1811.01136, 2018 - arxiv.org

Machine translation is highly sensitive to the size and quality of the training data, which has
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …

被引用次数：228 相关文章所有 5 个版本

[PDF] arxiv.org

Parallel strands: A preliminary investigation into mining the web for bilingual text

P Resnik - Conference of the Association for Machine Translation …, 1998 - Springer

Parallel corpora are a valuable resource for machine translation, but at present their
availability and utility is limited by genre-and domain-specificity, licensing restrictions, and …

被引用次数：187 相关文章所有 21 个版本

[PDF] aclanthology.org

Bifixer and bicleaner: two open-source tools to clean your parallel data

G Ramírez‐Sánchez… - Proceedings of the …, 2020 - aclanthology.org

This paper shows the utility of two open-source tools designed for parallel data cleaning:
Bifixer and Bicleaner. Already used to clean highly noisy parallel content from crawled …

被引用次数：46 相关文章所有 6 个版本

[PDF] ed.ac.uk

Document-level machine translation with large-scale public parallel corpora

P Pal, A Birch-Mayne, K Heafield - The 62nd Annual Meeting of …, 2024 - research.ed.ac.uk

Despite the fact that document-level machine translation has inherent advantages over
sentence-level machine translation due to additional information available to a model from …

被引用次数：2 相关文章所有 5 个版本

高级搜索

QQ 群

Dirt cheap web-scale parallel text from the common crawl

JParaCrawl: A large scale web-based English-Japanese parallel corpus

ParaCrawl: Web-scale acquisition of parallel corpora

[PDF][PDF] ParaCrawl: Web-scale parallel corpora for the languages of the EU

JParaCrawl v3. 0: A large-scale English-Japanese parallel corpus

CCAligned: A massive collection of cross-lingual web-document pairs

Margin-based parallel corpus mining with multilingual sentence embeddings

Parallel strands: A preliminary investigation into mining the web for bilingual text

Bifixer and bicleaner: two open-source tools to clean your parallel data

Document-level machine translation with large-scale public parallel corpora

相关搜索

引用