Text generation models for luxembourgish with limited data: A balanced multilingual strategy

A Plum, T Ranasinghe, C Purschke - arXiv preprint arXiv:2412.09415, 2024 - arxiv.org
This paper addresses the challenges in developing language models for less-represented
languages, with a focus on Luxembourgish. Despite its active development, Luxembourgish …

Not Enough Data to Pre-train Your Language Model? MT to the Rescue!

G Urbizu, I San Vicente, X Saralegi… - Findings of the …, 2023 - aclanthology.org
In recent years, pre-trained transformer-based language models (LM) have become a key
resource for implementing most NLP tasks. However, pre-training such models demands …

Neural Text Normalization for Luxembourgish using Real-Life Variation Data

AM Lutgen, A Plum, C Purschke, B Plank - arXiv preprint arXiv:2412.09383, 2024 - arxiv.org
Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-
fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult …

Scaling Laws for BERT in Low-Resource Settings

G Urbizu, I San Vicente, X Saralegi… - Findings of the …, 2023 - aclanthology.org
Large language models are very resource intensive, both financially and environmentally,
and require an amount of training data which is simply unobtainable for the majority of NLP …

LuxBank: The First Universal Dependency Treebank for Luxembourgish

A Plum, C Döhmer, E Milano, AM Lutgen… - arXiv preprint arXiv …, 2024 - arxiv.org
The Universal Dependencies (UD) project has significantly expanded linguistic coverage
across 161 languages, yet Luxembourgish, a West Germanic language spoken by …

Letz Translate: Low-Resource Machine Translation for Luxembourgish

Y Song, S Ezzini, J Klein, T Bissyande… - 2023 5th …, 2023 - ieeexplore.ieee.org
Natural language processing of Low-Resource Languages (LRL) is often challenged by the
lack of data. Therefore, achieving accurate machine translation (MT) in a low-resource …

Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language

A Plum, T Ranasinghe, C Purschke - arXiv preprint arXiv:2403.17143, 2024 - arxiv.org
Relation extraction is essential for extracting and understanding biographical information in
the context of digital humanities and related subjects. There is a growing interest in the …

LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings

F Philippy, S Guo, J Klein, TF Bissyandé - arXiv preprint arXiv:2412.03331, 2024 - arxiv.org
Sentence embedding models play a key role in various Natural Language Processing tasks,
such as in Topic Modeling, Document Clustering and Recommendation Systems. However …

[PDF][PDF] Evaluating Data Augmentation Techniques for the Training of Luxembourgish Language Models

I Olariu, C Lothritz, TFA BISSYANDE, J Klein - KONVENS, 2023 - orbilu.uni.lu
Training large language models is challenging when data availability is limited, as it is the
case for low-resource languages. We investigate different data augmentation techniques for …

Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish

F Philippy, S Haddadan, S Guo - arXiv preprint arXiv:2404.03912, 2024 - arxiv.org
In NLP, zero-shot classification (ZSC) is the task of assigning labels to textual data without
any labeled examples for the target classes. A common method for ZSC is to fine-tune a …