NusaCrowd: Open source initiative for Indonesian NLP resources

S Cahyawijaya, H Lovenia, AF Aji… - Findings of the …, 2023 - aclanthology.org
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for
Indonesian languages, including opening access to previously non-public resources …

Bloom: A 176b-parameter open-access multilingual language model

T Le Scao, A Fan, C Akiki, E Pavlick, S Ilić, D Hesslow… - 2023 - inria.hal.science
Large language models (LLMs) have been shown to be able to perform new tasks based on
a few demonstrations or natural language instructions. While these capabilities have led to …

The bigscience roots corpus: A 1.6 tb composite multilingual dataset

H Laurençon, L Saulnier, T Wang… - Advances in …, 2022 - proceedings.neurips.cc
As language models grow ever larger, the need for large-scale high-quality text datasets has
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …

The responsible foundation model development cheatsheet: A review of tools & resources

S Longpre, S Biderman, A Albalak… - arXiv preprint arXiv …, 2024 - arxiv.org
Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia

AF Aji, GI Winata, F Koto, S Cahyawijaya… - arXiv preprint arXiv …, 2022 - arxiv.org
NLP research is impeded by a lack of resources and awareness of the challenges presented
by underrepresented languages and dialects. Focusing on the languages spoken in …

[PDF][PDF] Trends and challenges of Arabic Chatbots: Literature review

Y Saoudi, MM Gammoudi - Jordanian Journal of Computers and …, 2023 - researchgate.net
Conversational systems have recently garnered increased attention due to advancements in
Large Language Models (LLMs) and Language Models for Dialogue Applications (LaMDA) …

Documenting geographically and contextually diverse data sources: The bigscience catalogue of language data and resources

A McMillan-Major, Z Alyafeai, S Biderman… - arXiv preprint arXiv …, 2022 - arxiv.org
In recent years, large-scale data collection efforts have prioritized the amount of data
collected in order to improve the modeling capabilities of large language models. This …

SAIDS: A novel approach for sentiment analysis informed of dialect and sarcasm

A Kaseb, M Farouk - arXiv preprint arXiv:2301.02521, 2023 - arxiv.org
Sentiment analysis becomes an essential part of every social network, as it enables decision-
makers to know more about users' opinions in almost all life aspects. Despite its importance …

Toxic language detection: A systematic review of Arabic datasets

I Bensalem, P Rosso, H Zitouni - Expert Systems, 2024 - Wiley Online Library
The detection of toxic language in the Arabic language has emerged as an active area of
research in recent years, and reviewing the existing datasets employed for training the …

Toxic language detection: a systematic review of Arabic datasets

I Bensalem, P Rosso, H Zitouni - arXiv preprint arXiv:2312.07228, 2023 - arxiv.org
The detection of toxic language in the Arabic language has emerged as an active area of
research in recent years, and reviewing the existing datasets employed for training the …