We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources …
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains …
J Jaavid, R Dabre, M Aswanth, J Gala… - Proceedings of the …, 2024 - aclanthology.org
This study addresses the challenge of extending Large Language Models (LLMs) to non- English languages, specifically those using non-Roman scripts. We propose an approach …
Automated mathematical problem-solving represents a unique intersection of natural language processing (NLP) and mathematical reasoning, posing significant challenges in …
We announce the initial release of" Airavata," an instruction-tuned LLM for Hindi. Airavata was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make …
Question generation (QG), the task of generating questions from text or other forms of data, a significant and challenging subject, has recently attracted more attention in natural language …
This paper introduces PMIndiaSum, a multilingual and massively parallel summarization corpus focused on languages in India. Our corpus provides a training and testing ground for …
We introduce mEdIT, a multi-lingual extension to CoEdIT--the recent state-of-the-art text editing models for writing assistance. mEdIT models are trained by fine-tuning multi-lingual …
T Santosh, C Weiss, M Grabmair - arXiv preprint arXiv:2410.09527, 2024 - arxiv.org
In the evolving NLP landscape, benchmarks serve as yardsticks for gauging progress. However, existing Legal NLP benchmarks only focus on predictive tasks, overlooking …