Red-Teaming for generative AI: Silver bullet or security theater?

M Feffer, A Sinha, WH Deng, ZC Lipton… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

Privacy in large language models: Attacks, defenses and future directions

H Li, Y Chen, J Luo, J Wang, H Peng, Y Kang… - arXiv preprint arXiv …, 2023 - arxiv.org
The advancement of large language models (LLMs) has significantly enhanced the ability to
effectively tackle various downstream NLP tasks and unify these tasks into generative …

Jailbreaking and mitigation of vulnerabilities in large language models

B Peng, Z Bi, Q Niu, M Liu, P Feng, T Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have transformed artificial intelligence by advancing
natural language understanding and generation, enabling applications across fields beyond …

Safer-instruct: Aligning language models with automated preference data

T Shi, K Chen, J Zhao - arXiv preprint arXiv:2311.08685, 2023 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is a vital strategy for enhancing
model safety in language models. However, annotating preference data for RLHF is a …

Defending jailbreak prompts via in-context adversarial game

Y Zhou, Y Han, H Zhuang, K Guo, Z Liang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) demonstrate remarkable capabilities across diverse
applications. However, concerns regarding their security, particularly the vulnerability to …

Rethinking Machine Ethics--Can LLMs Perform Moral Reasoning through the Lens of Moral Theories?

J Zhou, M Hu, J Li, X Zhang, X Wu, I King… - arXiv preprint arXiv …, 2023 - arxiv.org
Making moral judgments is an essential step toward developing ethical AI systems.
Prevalent approaches are mostly implemented in a bottom-up manner, which uses a large …

See what llms cannot answer: A self-challenge framework for uncovering llm weaknesses

Y Chen, Y Liu, J Yan, X Bai, M Zhong, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
The impressive performance of Large Language Models (LLMs) has consistently surpassed
numerous human-designed benchmarks, presenting new challenges in assessing the …

Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge

W Lu, Z Zeng, J Wang, Z Lu, Z Chen, H Zhuang… - arXiv preprint arXiv …, 2024 - arxiv.org
Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard
and generate harmful content. Existing jailbreaking defense methods have failed to address …

Mission impossible: A statistical perspective on jailbreaking llms

J Su, J Kempe, K Ullrich - arXiv preprint arXiv:2408.01420, 2024 - arxiv.org
Large language models (LLMs) are trained on a deluge of text data with limited quality
control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as …

Application of Large Language Models in Cybersecurity: a Systematic Literature Review

I Hasanov, S Virtanen, A Hakkala, J Isoaho - IEEE Access, 2024 - ieeexplore.ieee.org
The emergence of Large Language Models (LLMs) is currently creating a major paradigm
shift in societies and businesses in the way digital technologies are used. While the …