ParaCrawl: Web-scale acquisition of parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk
We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

[PDF][PDF] Findings of the wmt 2016 bilingual document alignment shared task

C Buck, P Koehn - Proceedings of the First Conference on …, 2016 - aclanthology.org
Findings of the WMT 2016 Bilingual Document Alignment Shared Task Page 1 Proceedings of
the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 554–563 …

Exploiting sentence order in document alignment

B Thompson, P Koehn - arXiv preprint arXiv:2004.14523, 2020 - arxiv.org
We present a simple document alignment method that incorporates sentence order
information in both candidate generation and candidate re-scoring. Our method results in …

Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages

P Koehn - Proceedings of the Ninth Conference on Machine …, 2024 - aclanthology.org
We introduce neural methods and a toxicity filtering step to the hierarchical web mining
approach of Paracrawl (Bañón et al., 2020), showing large improvements. We apply these …

Efficient document alignment across scenarios

A Azpeitia, T Etchegoyhen - Machine Translation, 2019 - Springer
We present and evaluate an approach to document alignment meant for efficiency and
portability, as it relies on automatically extracted lexical translations and simple set-theoretic …

Detecting Fine-Grained Semantic Divergences to Improve Translation Understanding Across Languages

E Briakou - 2023 - search.proquest.com
One of the core goals of Natural Language Processing (NLP) is to develop computational
representations and methods to compare and contrast text meaning across languages. Such …

Machine translation of user-generated content

P Lohar - 2020 - doras.dcu.ie
The world of social media has undergone huge evolution during the last few years. With the
spread of social media and online forums, individual users actively participate in the …

Влияние лексического сходства языков на переводимость каламбура

ЕМ Александрова - Филологические науки. Вопросы теории и …, 2019 - cyberleninka.ru
Цель статьи заключается в исследовании проблемы переводимости каламбура в
русском, английском и французском языках. В результате исследования выявляются …

[PDF][PDF] Sentence Similarity and Machine Translation

B Thompson - 2020 - jscholarship.library.jhu.edu
Neural machine translation (NMT) systems encode an input sentence into an intermediate
representation and then decode that representation into the output sentence. Translation …

Word embedding based semantic cross-lingual document alignment in comparable corpora

D Ganguly, H Afli, D Roy - Proceedings of the 10th Annual Meeting of …, 2018 - dl.acm.org
Crosslingual information retrieval (CLIR) finds its application in aligning documents across
comparable corpora. However, traditional CLIR, due to the term independence assumption …