Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space

L Schwinn, D Dobre, S Xhonneux, G Gidel… - arXiv preprint arXiv …, 2024 - arxiv.org
Current research in adversarial robustness of LLMs focuses on discrete input manipulations
in the natural language space, which can be directly transferred to closed-source models …

Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

B Nieth, T Altstidl, L Schwinn, B Eskofier - arXiv preprint arXiv:2406.13283, 2024 - arxiv.org
Their vulnerability to small, imperceptible attacks limits the adoption of deep learning models
to real-world systems. Adversarial training has proven to be one of the most promising …