Adversarial glue: A multi-task benchmark for robustness evaluation of language models

B Wang, C Xu, S Wang, Z Gan, Y Cheng, J Gao… - arXiv preprint arXiv …, 2021 - arxiv.org
Large-scale pre-trained language models have achieved tremendous success across a
wide range of natural language understanding (NLU) tasks, even surpassing human …

Defending pre-trained language models from adversarial word substitutions without performance sacrifice

R Bao, J Wang, H Zhao - arXiv preprint arXiv:2105.14553, 2021 - arxiv.org
Pre-trained contextualized language models (PrLMs) have led to strong performance gains
in downstream natural language understanding tasks. However, PrLMs can still be easily …

Better robustness by more coverage: Adversarial training with mixup augmentation for robust fine-tuning

C Si, Z Zhang, F Qi, Z Liu, Y Wang, Q Liu… - arXiv preprint arXiv …, 2020 - arxiv.org
Pretrained language models (PLMs) perform poorly under adversarial attacks. To improve
the adversarial robustness, adversarial data augmentation (ADA) has been widely adopted …

Infobert: Improving robustness of language models from an information theoretic perspective

B Wang, S Wang, Y Cheng, Z Gan, R Jia, B Li… - arXiv preprint arXiv …, 2020 - arxiv.org
Large-scale language models such as BERT have achieved state-of-the-art performance
across a wide range of NLP tasks. Recent studies, however, show that such BERT-based …

A LLM assisted exploitation of AI-Guardian

N Carlini - arXiv preprint arXiv:2307.15008, 2023 - arxiv.org
Large language models (LLMs) are now highly capable at a diverse range of tasks. This
paper studies whether or not GPT-4, one such LLM, is capable of assisting researchers in …

Bridge the gap between cv and nlp! a gradient-based textual adversarial attack framework

L Yuan, Y Zhang, Y Chen, W Wei - arXiv preprint arXiv:2110.15317, 2021 - arxiv.org
Despite recent success on various tasks, deep learning techniques still perform poorly on
adversarial examples with small perturbations. While optimization-based methods for …

Certified robustness against natural language attacks by causal intervention

H Zhao, C Ma, X Dong, AT Luu… - International …, 2022 - proceedings.mlr.press
Deep learning models have achieved great success in many fields, yet they are vulnerable
to adversarial examples. This paper follows a causal perspective to look into the adversarial …

Contextualized perturbation for textual adversarial attack

D Li, Y Zhang, H Peng, L Chen, C Brockett… - arXiv preprint arXiv …, 2020 - arxiv.org
Adversarial examples expose the vulnerabilities of natural language processing (NLP)
models, and can be used to evaluate and improve their robustness. Existing techniques of …

Defense against adversarial attacks in nlp via dirichlet neighborhood ensemble

Y Zhou, X Zheng, CJ Hsieh, K Chang… - arXiv preprint arXiv …, 2020 - arxiv.org
Despite neural networks have achieved prominent performance on many natural language
processing (NLP) tasks, they are vulnerable to adversarial examples. In this paper, we …

Baseline defenses for adversarial attacks against aligned language models

N Jain, A Schwarzschild, Y Wen, G Somepalli… - arXiv preprint arXiv …, 2023 - arxiv.org
As Large Language Models quickly become ubiquitous, their security vulnerabilities are
critical to understand. Recent work shows that text optimizers can produce jailbreaking …