Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia

H Schwenk, V Chaudhary, S Sun, H Gong… - arXiv preprint arXiv …, 2019 - arxiv.org
We present an approach based on multilingual sentence embeddings to automatically
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …

CCMatrix: Mining billions of high-quality parallel sentences on the web

H Schwenk, G Wenzek, S Edunov, E Grave… - arXiv preprint arXiv …, 2019 - arxiv.org
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …

[图书][B] Translation-driven corpora: Corpus resources for descriptive and applied translation studies

F Zanettin - 2014 - taylorfrancis.com
Electronic texts and text analysis tools have opened up a wealth of opportunities to higher
education and language service providers, but learning to use these resources continues to …

Building the bridge: Topic modeling for comparative research

F Lind, JM Eberl, O Eisele, T Heidenreich… - Communication …, 2022 - Taylor & Francis
In communication research, topic modeling is primarily used for discovering systematic
patterns in monolingual text corpora. To advance the usage, we provide an overview of …

A factory of comparable corpora from Wikipedia

A Barrón-Cedeno, C España Bonet… - Proceedings of the …, 2015 - upcommons.upc.edu
Multiple approaches to grab comparable data from the Web have been developed up to
date. Nevertheless, coming out with a high-quality comparable corpus of a specific topic is …

[PDF][PDF] Building comparable corpora based on bilingual lda model

Z Zhu, M Li, L Chen, Z Yang - … of the 51st Annual Meeting of the …, 2013 - aclanthology.org
Comparable corpora are important basic resources in cross-language information
processing. However, the existing methods of building comparable corpora, which use …

A deep neural network approach to parallel sentence extraction

F Grégoire, P Langlais - arXiv preprint arXiv:1709.09783, 2017 - arxiv.org
Parallel sentence extraction is a task addressing the data sparsity problem found in
multilingual natural language processing applications. We propose an end-to-end deep …

[HTML][HTML] Research on high-performance English translation based on topic model

Y Shen, H Guo - Digital Communications and Networks, 2023 - Elsevier
Retelling extraction is an important branch of Natural Language Processing (NLP), and high-
quality retelling resources are very helpful to improve the performance of machine …

[PDF][PDF] Automatic building and using parallel resources for SMT from comparable corpora

S Pal, P Pakray, SK Naskar - Proceedings of the 3rd Workshop on …, 2014 - aclanthology.org
Building parallel resources for corpus based machine translation, especially Statistical
Machine Translation (SMT), from comparable corpora has recently received wide attention …

A simple yet robust algorithm for automatic extraction of parallel sentences: A case study on arabic-english wikipedia articles

MJ Althobaiti - IEEE Access, 2021 - ieeexplore.ieee.org
Parallel corpora are vital components in several applications of Natural Language
Processing (NLP), particularly in machine translation. In this paper, we present a novel …