Neural machine translation for low-resource languages: A survey

S Ranathunga, ESA Lee, M Prifti Skenduli… - ACM Computing …, 2023 - dl.acm.org
Neural Machine Translation (NMT) has seen tremendous growth in the last ten years since
the early 2000s and has already entered a mature phase. While considered the most widely …

Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

No language left behind: Scaling human-centered machine translation

MR Costa-jussà, J Cross, O Çelebi, M Elbayad… - arXiv preprint arXiv …, 2022 - arxiv.org
Driven by the goal of eradicating language barriers on a global scale, machine translation
has solidified itself as a key focus of artificial intelligence research today. However, such …

Beyond english-centric multilingual machine translation

A Fan, S Bhosale, H Schwenk, Z Ma, A El-Kishky… - Journal of Machine …, 2021 - jmlr.org
Existing work in translation demonstrated the potential of massively multilingual machine
translation by training a single model able to translate between any pair of languages …

Findings of the 2019 conference on machine translation (WMT19)

L Barrault, O Bojar, MR Costa-Jussa, C Federmann… - 2019 - zora.uzh.ch
This paper presents the results of the premier shared task organized alongside the
Conference on Machine Translation (WMT) 2019. Participants were asked to build machine …

Survey of low-resource machine translation

B Haddow, R Bawden, AVM Barone, J Helcl… - Computational …, 2022 - direct.mit.edu
We present a survey covering the state of the art in low-resource machine translation (MT)
research. There are currently around 7,000 languages spoken in the world and almost all …

Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia

H Schwenk, V Chaudhary, S Sun, H Gong… - arXiv preprint arXiv …, 2019 - arxiv.org
We present an approach based on multilingual sentence embeddings to automatically
extract parallel sentences from the content of Wikipedia articles in 85 languages, including …

ParaCrawl: Web-scale acquisition of parallel corpora

M Bañón, P Chen, B Haddow, K Heafield, H Hoang… - 2020 - strathprints.strath.ac.uk
We report on methods to create the largest publicly available parallel corpora by crawling
the web, using open source software. We empirically compare alternative methods and …

Detecting hallucinated content in conditional neural sequence generation

C Zhou, G Neubig, J Gu, M Diab, P Guzman… - arXiv preprint arXiv …, 2020 - arxiv.org
Neural sequence models can generate highly fluent sentences, but recent studies have also
shown that they are also prone to hallucinate additional content not supported by the input …

CCMatrix: Mining billions of high-quality parallel sentences on the web

H Schwenk, G Wenzek, S Edunov, E Grave… - arXiv preprint arXiv …, 2019 - arxiv.org
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …