Pre-training data quality and quantity for a low-resource language: New corpus and BERT models...

PB Costa, MC Pavan, WR Santos… - Proceedings of the …, 2023 - aclanthology.org

Transformer-based language models such as Bidirectional Encoder Representations from
Transformers (BERT) are now mainstream in the NLP field, but extensions to languages …

被引用次数：15 相关文章所有 2 个版本

[PDF] aclanthology.org

Hints on the data for language modeling of synthetic languages with transformers

R Zevallos, N Bel - Proceedings of the 61st Annual Meeting of the …, 2023 - aclanthology.org

Abstract Language Models (LM) are becoming more and more useful for providing
representations upon which to train Natural Language Processing applications. However …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

A Survey of Large Language Models for European Languages

W Ali, S Pyysalo - arXiv preprint arXiv:2408.15040, 2024 - arxiv.org

Large Language Models (LLMs) have gained significant attention due to their high
performance on a wide range of natural language tasks since the release of ChatGPT. The …

被引用次数：1 相关文章所有 3 个版本

[PDF] wiley.com

Emerging roots: Investigating early access to meaning in Maltese auditory word recognition

J Nieder, R van de Vijver, A Ussishkin - Cognitive Science, 2024 - Wiley Online Library

In Semitic languages, the consonantal root is central to morphology, linking form and
meaning. While psycholinguistic studies highlight its importance in language processing, the …

被引用次数：1 相关文章所有 6 个版本

[PDF] arxiv.org

Disentangling singlish discourse particles with task-driven representation

LTE Foo, LHX Ng - Proceedings of the 6th ACM International Conference …, 2024 - dl.acm.org

Singlish, or formally Colloquial Singapore English, is an English-based creole language
originating from the SouthEast Asian country Singapore. The language contains influences …

被引用次数：2 相关文章所有 4 个版本

[PDF] aclanthology.org

Tokenisation in machine translation does matter: The impact of different tokenisation approaches for Maltese

K Abela, K Micallef, M Tanti, C Borg - Proceedings of the The …, 2024 - aclanthology.org

Abstract In Machine Translation, various tokenisers are used to segment inputs before
training a model. Despite tokenisation being mostly considered a solved problem for …

被引用次数：2 相关文章所有 2 个版本

Evaluating Language Model Vulnerability to Poisoning Attacks in Low-Resource Settings

R Plant, MV Giuffrida, N Pitropakis… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org

Pre-trained language models are a highly effective source of knowledge transfer for natural
language processing tasks, as their development represents an investment of resources …

[PDF] aclanthology.org

UOM-Constrained IWSLT 2024 Shared Task Submission-Maltese Speech Translation

K Abela, MAR Riyadh, M Galea, A Busuttil… - Proceedings of the …, 2024 - aclanthology.org

This paper presents our IWSLT-2024 shared task submission on the low-resource track. This
submission forms part of the constrained setup; implying limited data for training. Following …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

A Alajrami, K Margatina, N Aletras - arXiv preprint arXiv:2310.17271, 2023 - arxiv.org

Understanding how and what pre-trained language models (PLMs) learn about language is
an open challenge in natural language processing. Previous work has focused on …

被引用次数：2 相关文章所有 4 个版本

[PDF] um.edu.mt

Exploring the impact of transliteration on NLP performance: Treating Maltese as an Arabic dialect

K Micallef, F Eryani, N Habash, H Bouamor, C Borg - 2023 - um.edu.mt

Multilingual models such as mBERT have been demonstrated to exhibit impressive
crosslingual transfer for a number of languages. Despite this, the performance drops for …

被引用次数：2 相关文章所有 5 个版本

高级搜索

QQ 群