Estonian-Centric Machine Translation: Data, Models, and Challenges

E Korotkova, M Fishel - … of the 25th Annual Conference of the …, 2024 - aclanthology.org
Proceedings of the 25th Annual Conference of the European Association …, 2024aclanthology.org
Abstract Machine translation (MT) research is most typically English-centric. In recent years,
massively multilingual translation systems have also been increasingly popular. However,
efforts purposefully focused on less-resourced languages are less widespread. In this paper,
we focus on MT from and into the Estonian language. First, emphasizing the importance of
data availability, we generate and publicly release a back-translation corpus of over 2 billion
sentence pairs. Second, using these novel data, we create MT models covering 18 …
Abstract
Machine translation (MT) research is most typically English-centric. In recent years, massively multilingual translation systems have also been increasingly popular. However, efforts purposefully focused on less-resourced languages are less widespread. In this paper, we focus on MT from and into the Estonian language. First, emphasizing the importance of data availability, we generate and publicly release a back-translation corpus of over 2 billion sentence pairs. Second, using these novel data, we create MT models covering 18 translation directions, all either from or into Estonian. We re-use the encoder of the NLLB multilingual model and train modular decoders separately for each language, surpassing the original NLLB quality. Our resulting MT models largely outperform other open-source MT systems, including previous Estonian-focused efforts, and are released as part of this submission.
aclanthology.org
以上显示的是最相近的搜索结果。 查看全部搜索结果