相关文章- 学术资源搜索

Margin-based parallel corpus mining with multilingual sentence embeddings

M Artetxe, H Schwenk - arXiv preprint arXiv:1811.01136, 2018 - arxiv.org

Machine translation is highly sensitive to the size and quality of the training data, which has
led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we …

被引用次数：210 相关文章所有 5 个版本

[PDF] arxiv.org

Filtering and mining parallel data in a joint multilingual space

H Schwenk - arXiv preprint arXiv:1805.09822, 2018 - arxiv.org

We learn a joint multilingual sentence embedding and use the distance between sentences
in different languages to filter noisy parallel data and to mine for parallel data in large news …

被引用次数：121 相关文章所有 3 个版本

[PDF] ed.ac.uk

Findings of the WMT 2018 shared task on parallel corpus filtering

P Koehn, H Khayrallah, K Heafield… - EMNLP 2018 Third …, 2018 - research.ed.ac.uk

We posed the shared task of assigning sentence-level quality scores for a very noisy corpus
of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high …

被引用次数：119 相关文章所有 12 个版本

[PDF] aclanthology.org

Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions

P Koehn, F Guzmán, V Chaudhary… - Proceedings of the Fourth …, 2019 - aclanthology.org

Abstract Following the WMT 2018 Shared Task on Parallel Corpus Filtering, we posed the
challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs …

被引用次数：80 相关文章所有 6 个版本

[PDF] arxiv.org

Effective parallel corpus mining using bilingual sentence embeddings

M Guo, Q Shen, Y Yang, H Ge, D Cer… - arXiv preprint arXiv …, 2018 - arxiv.org

This paper presents an effective approach for parallel corpus mining using bilingual
sentence embeddings. Our embedding models are trained to produce similar …

被引用次数：119 相关文章所有 10 个版本

[PDF] aclanthology.org

Findings of the WMT 2020 shared task on parallel corpus filtering and alignment

P Koehn, V Chaudhary, A El-Kishky… - Proceedings of the …, 2020 - aclanthology.org

Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018,
2019), we posed again the challenge of assigning sentence-level quality scores for very …

被引用次数：73 相关文章

[PDF] strath.ac.uk

ParaCrawl: Web-scale acquisition of parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk

We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

被引用次数：226 相关文章所有 17 个版本

[PDF] aclanthology.org

Zipporah: a fast and scalable data cleaning system for noisy web-crawled parallel corpora

H Xu, P Koehn - Proceedings of the 2017 conference on empirical …, 2017 - aclanthology.org

We introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type
of bag-of-words translation feature, and train logistic regression models to classify good data …

被引用次数：71 相关文章所有 2 个版本

[PDF] arxiv.org

JParaCrawl: A large scale web-based English-Japanese parallel corpus

M Morishita, J Suzuki, M Nagata - arXiv preprint arXiv:1911.10668, 2019 - arxiv.org

Recent machine translation algorithms mainly rely on parallel corpora. However, since the
availability of parallel corpora remains limited, only some resource-rich language pairs can …

被引用次数：62 相关文章所有 4 个版本

[PDF] arxiv.org

Dual conditional cross-entropy filtering of noisy parallel corpora

M Junczys-Dowmunt - arXiv preprint arXiv:1809.00197, 2018 - arxiv.org

In this work we introduce dual conditional cross-entropy filtering for noisy parallel data. For
each sentence pair of the noisy parallel corpus we compute cross-entropy scores according …

被引用次数：138 相关文章所有 9 个版本

高级搜索

QQ 群

Margin-based parallel corpus mining with multilingual sentence embeddings

Filtering and mining parallel data in a joint multilingual space

Findings of the WMT 2018 shared task on parallel corpus filtering

Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions

Effective parallel corpus mining using bilingual sentence embeddings

Findings of the WMT 2020 shared task on parallel corpus filtering and alignment

ParaCrawl: Web-scale acquisition of parallel corpora

Zipporah: a fast and scalable data cleaning system for noisy web-crawled parallel corpora

JParaCrawl: A large scale web-based English-Japanese parallel corpus

Dual conditional cross-entropy filtering of noisy parallel corpora

相关搜索

引用