Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

A survey on data selection for language models

A Albalak, Y Elazar, SM Xie, S Longpre… - arXiv preprint arXiv …, 2024 - arxiv.org
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

Crosslingual generalization through multitask finetuning

N Muennighoff, T Wang, L Sutawika, A Roberts… - arXiv preprint arXiv …, 2022 - arxiv.org
Multitask prompted finetuning (MTF) has been shown to help large language models
generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused …

Octopack: Instruction tuning code large language models

N Muennighoff, Q Liu, A Zebaze, Q Zheng… - arXiv preprint arXiv …, 2023 - arxiv.org
Finetuning large language models (LLMs) on instructions leads to vast performance
improvements on natural language tasks. We apply instruction tuning using code …

Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity

TY Zhuo, Y Huang, C Chen, Z Xing - arXiv preprint arXiv:2301.12867, 2023 - arxiv.org
Recent breakthroughs in natural language processing (NLP) have permitted the synthesis
and comprehension of coherent text in an open-ended way, therefore translating the …

Measure and improve robustness in NLP models: A survey

X Wang, H Wang, D Yang - arXiv preprint arXiv:2112.08313, 2021 - arxiv.org
As NLP models achieved state-of-the-art performances over benchmarks and gained wide
applications, it has been increasingly important to ensure the safe deployment of these …

Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models

EM Bakr, P Sun, X Shen, FF Khan… - Proceedings of the …, 2023 - openaccess.thecvf.com
Designing robust text-to-image (T2I) models have been extensively explored in recent years,
especially with the emergence of diffusion models, which achieves state-of-the-art results on …

NLPositionality: Characterizing design biases of datasets and models

S Santy, JT Liang, RL Bras, K Reinecke… - arXiv preprint arXiv …, 2023 - arxiv.org
Design biases in NLP systems, such as performance differences for different populations,
often stem from their creator's positionality, ie, views and lived experiences shaped by …

The state of human-centered NLP technology for fact-checking

A Das, H Liu, V Kovatchev, M Lease - Information processing & …, 2023 - Elsevier
Misinformation threatens modern society by promoting distrust in science, changing
narratives in public health, heightening social polarization, and disrupting democratic …

Aya model: An instruction finetuned open-access multilingual language model

A Üstün, V Aryabumi, ZX Yong, WY Ko… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent breakthroughs in large language models (LLMs) have centered around a handful of
data-rich languages. What does it take to broaden access to breakthroughs beyond first …