Constitutional ai: Harmlessness from ai feedback Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... arXiv preprint arXiv:2212.08073, 2022 | 1107 | 2022 |
Measuring progress on scalable oversight for large language models SR Bowman, J Hyun, E Perez, E Chen, C Pettit, S Heiner, K Lukošiūtė, ... arXiv preprint arXiv:2211.03540, 2022 | 86 | 2022 |
Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b S Lermen, C Rogers-Smith, J Ladish arXiv preprint arXiv:2310.20624, 2023 | 67 | 2023 |
Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b P Gade, S Lermen, C Rogers-Smith, J Ladish arXiv preprint arXiv:2311.00117, 2023 | 19 | 2023 |
Constitutional AI: harmlessness from AI feedback. 2022 Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... arXiv preprint arXiv:2212.08073, 2022 | 18 | 2022 |
Open problems in technical ai governance A Reuel, B Bucknall, S Casper, T Fist, L Soder, O Aarne, L Hammond, ... arXiv preprint arXiv:2407.14981, 2024 | 11 | 2024 |
Constitutional ai: Harmlessness from ai feedback. arXiv 2022 Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... arXiv preprint arXiv:2212.08073, 2023 | 10 | 2023 |
Hands-on cybersecurity exercises for introductory classes: tutorial presentation R Weiss, J Ladish, J Mache, ME Locasto Journal of Computing Sciences in Colleges 32 (1), 173-175, 2016 | 5 | 2016 |
Constitutional AI: Harmlessness from AI Feedback, December 2022 Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... URL http://arxiv. org/abs/2212.08073 1, 0 | 5 | |
Information security considerations for AI and the long term future J Ladish, L Heim URL: https://blog. heim. xyz/information-securityconsiderations-for-ai …, 2022 | 4 | 2022 |
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits A Draguns, A Gritsevskiy, SR Motwani, C Rogers-Smith, J Ladish, ... arXiv preprint arXiv:2406.02619, 2024 | | 2024 |