Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023 | 239 | 2023 |
Toward transparent ai: A survey on interpreting the inner structures of deep neural networks T Räuker, A Ho, S Casper, D Hadfield-Menell 2023 ieee conference on secure and trustworthy machine learning (satml), 464-483, 2023 | 120 | 2023 |
Explore, establish, exploit: Red teaming language models from scratch S Casper, J Lin, J Kwon, G Culp, D Hadfield-Menell arXiv preprint arXiv:2306.09442, 2023 | 49 | 2023 |
Scalable and transferable black-box jailbreaks for language models via persona modulation R Shah, S Pour, A Tagade, S Casper, J Rando arXiv preprint arXiv:2311.03348, 2023 | 42 | 2023 |
Rethinking machine unlearning for large language models S Liu, Y Yao, J Jia, S Casper, N Baracaldo, P Hase, X Xu, Y Yao, H Li, ... arXiv preprint arXiv:2402.08787, 2024 | 34 | 2024 |
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024 | 29 | 2024 |
Frivolous units: Wider networks are not really that wide S Casper, X Boix, V D'Amario, L Guo, M Schrimpf, K Vinken, G Kreiman Proceedings of the AAAI Conference on Artificial Intelligence 35 (8), 6921-6929, 2021 | 28* | 2021 |
Clusterability in neural networks D Filan, S Casper, S Hod, C Wild, A Critch, S Russell arXiv preprint arXiv:2103.03386, 2021 | 27 | 2021 |
Red teaming deep neural networks with feature synthesis tools S Casper, T Bu, Y Li, J Li, K Zhang, K Hariharan, D Hadfield-Menell Advances in Neural Information Processing Systems 36, 80470-80516, 2023 | 25* | 2023 |
Robust feature-level adversaries are interpretability tools S Casper, M Nadeau, D Hadfield-Menell, G Kreiman Advances in Neural Information Processing Systems 35, 33093-33106, 2022 | 24 | 2022 |
Black-box access is insufficient for rigorous ai audits S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ... The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2254-2272, 2024 | 18 | 2024 |
Probing neural dialog models for conversational understanding A Saleh, T Deutsch, S Casper, Y Belinkov, S Shieber arXiv preprint arXiv:2006.08331, 2020 | 15 | 2020 |
Eight methods to evaluate robust unlearning in llms A Lynch, P Guo, A Ewart, S Casper, D Hadfield-Menell arXiv preprint arXiv:2402.16835, 2024 | 12 | 2024 |
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? K Liu, S Casper, D Hadfield-Menell, J Andreas arXiv preprint arXiv:2312.03729, 2023 | 12 | 2023 |
Detecting modularity in deep neural networks S Hod, S Casper, D Filan, C Wild, A Critch, S Russell | 11* | 2021 |
Graphical clusterability and local specialization in deep neural networks S Casper, S Hod, D Filan, C Wild, A Critch, S Russell ICLR 2022 Workshop on PAIR {\textasciicircum} 2Struct: Privacy …, 2022 | 9 | 2022 |
Diagnostics for deep neural networks with automated copy/paste attacks S Casper, K Hariharan, D Hadfield-Menell arXiv preprint arXiv:2211.10024, 2022 | 8 | 2022 |
Quantifying local specialization in deep neural networks S Hod, D Filan, S Casper, A Critch, S Russell arXiv preprint arXiv:2110.08058, 2021 | 8 | 2021 |
Defending Against Unforeseen Failure Modes with Latent Adversarial Training S Casper, L Schulze, O Patel, D Hadfield-Menell arXiv preprint arXiv:2403.05030, 2024 | 6 | 2024 |
Open problems and fundamental limitations of reinforcement learning from human feedback. CoRR, abs/2307.15217, 2023. doi: 10.48550 S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint ARXIV.2307.15217, 0 | 6 | |