JParaCrawl: A large scale web-based English-Japanese parallel corpus

M Morishita, J Suzuki, M Nagata - arXiv preprint arXiv:1911.10668, 2019 - arxiv.org
Recent machine translation algorithms mainly rely on parallel corpora. However, since the
availability of parallel corpora remains limited, only some resource-rich language pairs can …

JParaCrawl v3. 0: A large-scale English-Japanese parallel corpus

M Morishita, K Chousa, J Suzuki, M Nagata - arXiv preprint arXiv …, 2022 - arxiv.org
Most current machine translation models are mainly trained with parallel corpora, and their
translation accuracy largely depends on the quality and quantity of the corpora. Although …

Unsupervised machine translation using monolingual corpora only

G Lample, A Conneau, L Denoyer… - arXiv preprint arXiv …, 2017 - arxiv.org
Machine translation has recently achieved impressive performance thanks to recent
advances in deep learning and the availability of large-scale parallel corpora. There have …

A large English–Thai parallel corpus from the web and machine-generated text

L Lowphansirikul, C Polpanumas… - Language Resources …, 2022 - Springer
The primary objective of our work is to build a large-scale English–Thai dataset for training
neural machine translation models. We construct scb-mt-en-th-2020, an English–Thai …

Parallel corpus filtering via pre-trained language models

B Zhang, A Nagesh, K Knight - arXiv preprint arXiv:2005.06166, 2020 - arxiv.org
Web-crawled data provides a good source of parallel corpora for training machine
translation models. It is automatically obtained, but extremely noisy, and recent work shows …

A data augmentation method for English-Vietnamese neural machine translation

NL Pham, TV Pham - IEEE Access, 2023 - ieeexplore.ieee.org
The translation quality of machine translation systems depends on the parallel corpus used
for training, particularly on the quantity and quality of the corpus. However, building a high …

Phrase-based & neural unsupervised machine translation

G Lample, M Ott, A Conneau, L Denoyer… - arXiv preprint arXiv …, 2018 - arxiv.org
Machine translation systems achieve near human-level performance on some languages,
yet their effectiveness strongly relies on the availability of large amounts of parallel …

Document-level machine translation with large-scale public parallel corpora

P Pal, A Birch-Mayne, K Heafield - The 62nd Annual Meeting of …, 2024 - research.ed.ac.uk
Despite the fact that document-level machine translation has inherent advantages over
sentence-level machine translation due to additional information available to a model from …

Improving low-resource neural machine translation with filtered pseudo-parallel corpus

A Imankulova, T Sato, M Komachi - … of the 4th Workshop on Asian …, 2017 - aclanthology.org
Large-scale parallel corpora are indispensable to train highly accurate machine translators.
However, manually constructed large-scale parallel corpora are not freely available in many …

Extract and edit: An alternative to back-translation for unsupervised neural machine translation

J Wu, X Wang, WY Wang - arXiv preprint arXiv:1904.02331, 2019 - arxiv.org
The overreliance on large parallel corpora significantly limits the applicability of machine
translation systems to the majority of language pairs. Back-translation has been dominantly …