A survey on data augmentation for text classification

M Bayer, MA Kaufhold, C Reuter - ACM Computing Surveys, 2022 - dl.acm.org
Data augmentation, the artificial creation of training data for machine learning by
transformations, is a widely studied research field across machine learning disciplines …

Explainable ai: A review of machine learning interpretability methods

P Linardatos, V Papastefanopoulos, S Kotsiantis - Entropy, 2020 - mdpi.com
Recent advances in artificial intelligence (AI) have led to its widespread industrial adoption,
with machine learning systems demonstrating superhuman performance in a significant …

Universal and transferable adversarial attacks on aligned language models

A Zou, Z Wang, N Carlini, M Nasr, JZ Kolter… - arXiv preprint arXiv …, 2023 - arxiv.org
Because" out-of-the-box" large language models are capable of generating a great deal of
objectionable content, recent work has focused on aligning these models in an attempt to …

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr… - Advances in …, 2024 - proceedings.neurips.cc
Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery

Y Wen, N Jain, J Kirchenbauer… - Advances in …, 2024 - proceedings.neurips.cc
The strength of modern generative models lies in their ability to be controlled through
prompts. Hard prompts comprise interpretable words and tokens, and are typically hand …

Red teaming language models with language models

E Perez, S Huang, F Song, T Cai, R Ring… - arXiv preprint arXiv …, 2022 - arxiv.org
Language Models (LMs) often cannot be deployed because of their potential to harm users
in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Visual adversarial examples jailbreak aligned large language models

X Qi, K Huang, A Panda, P Henderson… - Proceedings of the …, 2024 - ojs.aaai.org
Warning: this paper contains data, prompts, and model outputs that are offensive in nature.
Recently, there has been a surge of interest in integrating vision into Large Language …

On the adversarial robustness of multi-modal foundation models

C Schlarmann, M Hein - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Multi-modal foundation models combining vision and language models such as Flamingo or
GPT-4 have recently gained enormous interest. Alignment of foundation models is used to …

Automatically auditing large language models via discrete optimization

E Jones, A Dragan, A Raghunathan… - International …, 2023 - proceedings.mlr.press
Auditing large language models for unexpected behaviors is critical to preempt catastrophic
deployments, yet remains challenging. In this work, we cast auditing as an optimization …