CCAligned: A massive collection of cross-lingual web-document pairs

A El-Kishky, V Chaudhary, F Guzmán… - arXiv preprint arXiv …, 2019 - arxiv.org
Cross-lingual document alignment aims to identify pairs of documents in two distinct
languages that are of comparable content or translations of each other. In this paper, we …

[PDF][PDF] Large scale parallel document mining for machine translation

J Uszkoreit, J Ponte, A Popat… - Proceedings of the 23rd …, 2010 - aclanthology.org
A distributed system is described that reliably mines parallel text from large corpora. The
approach can be regarded as cross-language near-duplicate detection, enabled by an …

Arabic information retrieval

K Darwish, W Magdy - Foundations and Trends® in …, 2014 - nowpublishers.com
In the past several years, Arabic Information Retrieval (IR) has garnered significant attention.
The main research interests have focused on retrieval of formal language, mostly in the …

[PDF][PDF] Crisis MT: Developing a cookbook for MT in crisis situations

W Lewis, R Munro, S Vogel - … of the Sixth Workshop on Statistical …, 2011 - aclanthology.org
In this paper, we propose that MT is an important technology in crisis events, something that
can and should be an integral part of a rapid-response infrastructure. By integrating MT …

Low-resource machine transliteration using recurrent neural networks

NT Le, F Sadat, L Menard, D Dinh - ACM transactions on Asian and low …, 2019 - dl.acm.org
Grapheme-to-phoneme models are key components in automatic speech recognition and
text-to-speech systems. With low-resource language pairs that do not have available and …

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

A El-Kishky, F Guzmán - arXiv preprint arXiv:2002.00761, 2020 - arxiv.org
Document alignment aims to identify pairs of documents in two distinct languages that are of
comparable content or translations of each other. Such aligned data can be used for a …

Machine transliteration and transliterated text retrieval: a survey

DK Prabhakar, S Pal - Sādhanā, 2018 - Springer
Users of the WWW across the globe are increasing rapidly. According to Internet live stats
there are more than 3 billion Internet users worldwide today and the number of non-English …

[PDF][PDF] Report of NEWS 2010 transliteration mining shared task

A Kumaran, MM Khapra, H Li - Proceedings of the 2010 Named …, 2010 - aclanthology.org
This report documents the details of the Transliteration Mining Shared Task that was run as
a part of the Named Entities Workshop (NEWS 2010), an ACL 2010 workshop. The shared …

Low-resource machine transliteration using recurrent neural networks of asian languages

NT Le, F Sadat - Proceedings of the Seventh Named Entities …, 2018 - aclanthology.org
Grapheme-to-phoneme models are key components in automatic speech recognition and
text-to-speech systems. With low-resource language pairs that do not have available and …

Transliteration for resource-scarce languages

MK Chinnakotla, OP Damani, A Satoskar - ACM Transactions on Asian …, 2010 - dl.acm.org
Today, parallel corpus-based systems dominate the transliteration landscape. But the
resource-scarce languages do not enjoy the luxury of large parallel transliteration corpus …