Representation engineering: A top-down approach to ai transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 2023 | 115 | 2023 |
The effects of reward misspecification: Mapping and mitigating misaligned models A Pan, K Bhatia, J Steinhardt arXiv preprint arXiv:2201.03544, 2022 | 97 | 2022 |
Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark A Pan, JS Chan, A Zou, N Li, S Basart, T Woodside, H Zhang, S Emmons, ... International Conference on Machine Learning, 26837-26867, 2023 | 82 | 2023 |
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024 | 23 | 2024 |
The wmdp benchmark: Measuring and reducing malicious use with unlearning N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ... arXiv preprint arXiv:2403.03218, 2024 | 12 | 2024 |
Feedback loops with language models drive in-context reward hacking A Pan, E Jones, M Jagadeesan, J Steinhardt arXiv preprint arXiv:2402.06627, 2024 | 8 | 2024 |
Improving robustness of reinforcement learning for power system control with adversarial training A Pan, Y Lee, H Zhang, Y Chen, Y Shi arXiv preprint arXiv:2110.08956, 2021 | 8 | 2021 |