关注
Alexander Pan
Alexander Pan
在 berkeley.edu 的电子邮件经过验证 - 首页
标题
引用次数
引用次数
年份
Representation engineering: A top-down approach to ai transparency
A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ...
arXiv preprint arXiv:2310.01405, 2023
1152023
The effects of reward misspecification: Mapping and mitigating misaligned models
A Pan, K Bhatia, J Steinhardt
arXiv preprint arXiv:2201.03544, 2022
972022
Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark
A Pan, JS Chan, A Zou, N Li, S Basart, T Woodside, H Zhang, S Emmons, ...
International Conference on Machine Learning, 26837-26867, 2023
822023
Foundational challenges in assuring alignment and safety of large language models
U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ...
arXiv preprint arXiv:2404.09932, 2024
232024
The wmdp benchmark: Measuring and reducing malicious use with unlearning
N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ...
arXiv preprint arXiv:2403.03218, 2024
122024
Feedback loops with language models drive in-context reward hacking
A Pan, E Jones, M Jagadeesan, J Steinhardt
arXiv preprint arXiv:2402.06627, 2024
82024
Improving robustness of reinforcement learning for power system control with adversarial training
A Pan, Y Lee, H Zhang, Y Chen, Y Shi
arXiv preprint arXiv:2110.08956, 2021
82021
系统目前无法执行此操作,请稍后再试。
文章 1–7