Smoothllm: Defending large language models against jailbreaking attacks

A Robey, E Wong, H Hassani, GJ Pappas - arXiv preprint arXiv …, 2023 - arxiv.org
Despite efforts to align large language models (LLMs) with human values, widely-used
LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks …

Random features for kernel approximation: A survey on algorithms, theory, and beyond

F Liu, X Huang, Y Chen… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
The class of random features is one of the most popular techniques to speed up kernel
methods in large-scale problems. Related works have been recognized by the NeurIPS Test …

On the tradeoff between robustness and fairness

X Ma, Z Wang, W Liu - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Abstract Interestingly, recent experimental results [2, 26, 22] have identified a robust fairness
phenomenon in adversarial training (AT), namely that a robust model well-trained by AT …

Feature purification: How adversarial training performs robust deep learning

Z Allen-Zhu, Y Li - 2021 IEEE 62nd Annual Symposium on …, 2022 - ieeexplore.ieee.org
Despite the empirical success of using adversarial training to defend deep learning models
against adversarial perturbations, so far, it still remains rather unclear what the principles are …

A model of double descent for high-dimensional binary linear classification

Z Deng, A Kammoun… - Information and Inference …, 2022 - academic.oup.com
We consider a model for logistic regression where only a subset of features of size is used
for training a linear classifier over training samples. The classifier is obtained by running …

Enhanced accuracy and robustness via multi-teacher adversarial distillation

S Zhao, J Yu, Z Sun, B Zhang, X Wei - European Conference on Computer …, 2022 - Springer
Adversarial training is an effective approach for improving the robustness of deep neural
networks against adversarial attacks. Although bringing reliable robustness, adversarial …

The Triangular Trade-off between Robustness, Accuracy and Fairness in Deep Neural Networks: A Survey

J Li, G Li - ACM Computing Surveys, 2024 - dl.acm.org
With the rapid development of deep learning, AI systems are being used more in complex
and important domains and necessitates the simultaneous fulfillment of multiple constraints …

Improving generalization of adversarial training via robust critical fine-tuning

K Zhu, X Hu, J Wang, X Xie… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Deep neural networks are susceptible to adversarial examples, posing a significant security
risk in critical applications. Adversarial Training (AT) is a well-established technique to …

Stability analysis and generalization bounds of adversarial training

J Xiao, Y Fan, R Sun, J Wang… - Advances in Neural …, 2022 - proceedings.neurips.cc
In adversarial machine learning, deep neural networks can fit the adversarial examples on
the training dataset but have poor generalization ability on the test set. This phenomenon is …

Randomized adversarial training via taylor expansion

G Jin, X Yi, D Wu, R Mu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
In recent years, there has been an explosion of research into developing more robust deep
neural networks against adversarial examples. Adversarial training appears as one of the …