Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought J Chua*, E Rees, H Batra, SR Bowman, J Michael, E Perez, M Turpin arXiv preprint arXiv:2403.05518, 2024 | 6 | 2024 |
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? R Schaeffer, D Valentine, L Bailey, J Chua, C Eyzaguirre, Z Durante, ... arXiv preprint arXiv:2407.15211, 2024 | 4 | 2024 |
Looking Inward: Language Models Can Learn About Themselves by Introspection FJ Binder, J Chua*, T Korbak, H Sleight, J Hughes, R Long, E Perez, ... arXiv preprint arXiv:2410.13787, 2024 | | 2024 |
Language Models Can Articulate Their Implicit Goals J Betley, X Bao, M Soto, A Sztyber-Betley, J Chua, O Evans Neurips Safe Generative AI Workshop 2024, 0 | | |