Masader: Metadata sourcing for arabic text and speech data resources

S Cahyawijaya, H Lovenia, AF Aji… - Findings of the …, 2023 - aclanthology.org

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for
Indonesian languages, including opening access to previously non-public resources …

被引用次数：969 相关文章所有 7 个版本

[PDF] hal.science

Bloom: A 176b-parameter open-access multilingual language model

T Le Scao, A Fan, C Akiki, E Pavlick, S Ilić, D Hesslow… - 2023 - inria.hal.science

Large language models (LLMs) have been shown to be able to perform new tasks based on
a few demonstrations or natural language instructions. While these capabilities have led to …

被引用次数：1713 相关文章所有 16 个版本

[PDF] neurips.cc

The bigscience roots corpus: A 1.6 tb composite multilingual dataset

H Laurençon, L Saulnier, T Wang… - Advances in …, 2022 - proceedings.neurips.cc

As language models grow ever larger, the need for large-scale high-quality text datasets has
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …

被引用次数：182 相关文章所有 21 个版本

[PDF] arxiv.org

The responsible foundation model development cheatsheet: A review of tools & resources

S Longpre, S Biderman, A Albalak… - arXiv preprint arXiv …, 2024 - arxiv.org

Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia

AF Aji, GI Winata, F Koto, S Cahyawijaya… - arXiv preprint arXiv …, 2022 - arxiv.org

NLP research is impeded by a lack of resources and awareness of the challenges presented
by underrepresented languages and dialects. Focusing on the languages spoken in …

被引用次数：80 相关文章所有 10 个版本

[PDF] researchgate.net

[PDF][PDF] Trends and challenges of Arabic Chatbots: Literature review

Y Saoudi, MM Gammoudi - Jordanian Journal of Computers and …, 2023 - researchgate.net

Conversational systems have recently garnered increased attention due to advancements in
Large Language Models (LLMs) and Language Models for Dialogue Applications (LaMDA) …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Documenting geographically and contextually diverse data sources: The bigscience catalogue of language data and resources

A McMillan-Major, Z Alyafeai, S Biderman… - arXiv preprint arXiv …, 2022 - arxiv.org

In recent years, large-scale data collection efforts have prioritized the amount of data
collected in order to improve the modeling capabilities of large language models. This …

被引用次数：16 相关文章所有 4 个版本

[PDF] arxiv.org

SAIDS: A novel approach for sentiment analysis informed of dialect and sarcasm

A Kaseb, M Farouk - arXiv preprint arXiv:2301.02521, 2023 - arxiv.org

Sentiment analysis becomes an essential part of every social network, as it enables decision-
makers to know more about users' opinions in almost all life aspects. Despite its importance …

被引用次数：10 相关文章所有 6 个版本

[PDF] researchgate.net

Toxic language detection: A systematic review of Arabic datasets

I Bensalem, P Rosso, H Zitouni - Expert Systems, 2024 - Wiley Online Library

The detection of toxic language in the Arabic language has emerged as an active area of
research in recent years, and reviewing the existing datasets employed for training the …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Toxic language detection: a systematic review of Arabic datasets

I Bensalem, P Rosso, H Zitouni - arXiv preprint arXiv:2312.07228, 2023 - arxiv.org

The detection of toxic language in the Arabic language has emerged as an active area of
research in recent years, and reviewing the existing datasets employed for training the …

被引用次数：2 相关文章所有 2 个版本

高级搜索

QQ 群