Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations Z Wei, Y Wang, A Li, Y Mo, Y Wang arXiv preprint arXiv:2310.06387, 2023 | 76 | 2023 |
CFA: Class-wise Calibrated Fair Adversarial Training Z Wei, Y Wang, Y Guo, Y Wang CVPR 2023, 2023 | 31 | 2023 |
Jatmo: Prompt injection defense by task-specific finetuning J Piet, M Alrashed, C Sitawarin, S Chen, Z Wei, B Alomair, D Wagner ESORICS 2024, 2024 | 19 | 2024 |
Sharpness-Aware Minimization Alone can Improve Adversarial Robustness Z Wei✉️, J Zhu, Y Zhang ICML 2023 Workshop on New Frontiers in Adversarial Machine Learning, 2023 | 12* | 2023 |
Fight back against jailbreaking via prompt adversarial tuning Y Mo, Y Wang, Z Wei, Y Wang ICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024 | 8* | 2024 |
Extracting Weighted Finite Automata from Recurrent Neural Networks for Natural Languages Z Wei, X Zhang, M Sun ICFEM 2022, 2022 | 8 | 2022 |
Weighted Automata Extraction and Explanation of Recurrent Neural Networks for Natural Language Tasks Z Wei, X Zhang, Y Zhang, M Sun Journal of Logical and Algebraic Methods in Programming 136, 100907, 2023 | 7 | 2023 |
Using Z3 for Formal Modeling and Verification of FNN Global Robustness Y Zhang, Z Wei, X Zhang, M Sun SEKE 2023, 2023 | 6 | 2023 |
Architecture Matters: Uncovering Implicit Mechanisms in Graph Contrastive Learning X Guo, Y Wang, Z Wei, Y Wang NeurIPS 2023, 2023 | 5 | 2023 |
Boosting Jailbreak Attack with Momentum Y Zhang, Z Wei✉️ ICLR 2024 Workshop on Reliable and Responsible Foundation Models, 2024 | 4 | 2024 |
On the Duality Between Sharpness-Aware Minimization and Adversarial Training Y Zhang, H He, J Zhu, H Chen, Y Wang, Z Wei✉️ ICML 2024, 2024 | 4 | 2024 |
Exploring the Robustness of In-Context Learning with Noisy Labels C Cheng, X Yu, H Wen, J Sun, G Yue, Y Zhang, Z Wei✉️ ICLR 2024 Workshop on Reliable and Responsible Foundation Models, 2024 | 3 | 2024 |
Characterizing Robust Overfitting in Adversarial Training via Cross-Class Features Z Wei, Y Guo, Y Wang OpenReview preprint, 2023 | 1 | 2023 |
Automata Extraction from Transformers Y Zhang, Z Wei, M Sun arXiv preprint arXiv:2406.05564, 2024 | | 2024 |
A Theoretical Understanding of Self-Correction through In-context Alignment Y Wang, Y Wu, Z Wei, S Jegelka, Y Wang ICML 2024 Workshop on In-Context Learning, 2024 | | 2024 |
Towards General Conceptual Model Editing via Adversarial Representation Engineering Y Zhang, Z Wei, J Sun, M Sun arXiv preprint arXiv:2404.13752, 2024 | | 2024 |