Unsupervised domain clusters in pretrained language models

R Aharoni, Y Goldberg - arXiv preprint arXiv:2004.02105, 2020 - arxiv.org
The notion of" in-domain data" in NLP is often over-simplistic and vague, as textual data
varies in many nuanced linguistic aspects such as topic, style or level of formality. In …

Dynamic data selection for neural machine translation

M Van Der Wees, A Bisazza, C Monz - arXiv preprint arXiv:1708.00712, 2017 - arxiv.org
Intelligent selection of training data has proven a successful technique to simultaneously
increase training efficiency and translation performance for phrase-based machine …

Incorporating Collaborative and Active Learning Strategies in the Design and Deployment of a Master Course on Computer-Assisted Scientific Translation

M Zappatore - Technology, Knowledge and Learning, 2024 - Springer
This research aims to address the current gaps in computer-assisted translation (CAT)
courses offered in bachelor's and master's programmes in scientific and technical translation …

Extracting in-domain training corpora for neural machine translation using data selection methods

C Cruz Silva, CH Liu, A Poncelas, A Way - 2018 - doras.dcu.ie
Data selection is a process used in selecting a subset of parallel data for the training of
machine translation (MT) systems, so that 1) resources for training might be reduced, 2) …

[PDF][PDF] Translation quality and productivity: A study on rich morphology languages

L Specia, K Harris, F Blain, A Burchardt… - … XVI: Research Track, 2017 - aclanthology.org
This paper introduces a unique large-scale machine translation dataset with various levels
of human annotation combined with automatically recorded productivity features such as …

Automatic document selection for efficient encoder pretraining

Y Feng, P Xia, B Van Durme, J Sedoc - arXiv preprint arXiv:2210.10951, 2022 - arxiv.org
Building pretrained language models is considered expensive and data-intensive, but must
we increase dataset size to achieve better performance? We propose an alternative to larger …

Active learning for neural machine translation

P Zhang, X Xu, D Xiong - 2018 International Conference on …, 2018 - ieeexplore.ieee.org
Neural machine translation (NMT) normally requires a large bilingual corpus to train a high-
translation-quality model. However, building such parallel corpora for many low-resource …

Separating grains from the chaff: Using data filtering to improve multilingual translation for low-resourced African languages

I Abdulmumin, M Beukman, JO Alabi, C Emezue… - arXiv preprint arXiv …, 2022 - arxiv.org
We participated in the WMT 2022 Large-Scale Machine Translation Evaluation for the
African Languages Shared Task. This work describes our approach, which is based on …

Adaptive Modeling of Uncertainties for Traffic Forecasting

Y Wu, Y Ye, A Zeb, JJ Yu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Deep neural networks (DNNs) have emerged as a dominant approach for developing traffic
forecasting models. These models are typically trained to minimize error on averaged test …

Feature decay algorithms for neural machine translation

Neural Machine Translation (NMT) systems require a lot of data to be competitive. For this
reason, data selection techniques are used only for fine-tuning systems that have been …