Inference-time intervention: Eliciting truthful answers from a language model K Li, O Patel, F Viégas, H Pfister, M Wattenberg Advances in Neural Information Processing Systems 36, 2024 | 149 | 2024 |
The wmdp benchmark: Measuring and reducing malicious use with unlearning N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ... arXiv preprint arXiv:2403.03218, 2024 | 19 | 2024 |
Defending Against Unforeseen Failure Modes with Latent Adversarial Training S Casper, L Schulze, O Patel, D Hadfield-Menell arXiv preprint arXiv:2403.05030, 2024 | 6 | 2024 |
Designing a Dashboard for Transparency and Control of Conversational AI Y Chen, A Wu, T DePodesta, C Yeh, K Li, NC Marin, O Patel, J Riecke, ... arXiv preprint arXiv:2406.07882, 2024 | | 2024 |