Self-guard: Empower the llm to safeguard itself

M Feffer, A Sinha, WH Deng, ZC Lipton… - Proceedings of the AAAI …, 2024 - ojs.aaai.org

In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

被引用次数：37 相关文章所有 2 个版本

[PDF] arxiv.org

Privacy in large language models: Attacks, defenses and future directions

H Li, Y Chen, J Luo, J Wang, H Peng, Y Kang… - arXiv preprint arXiv …, 2023 - arxiv.org

The advancement of large language models (LLMs) has significantly enhanced the ability to
effectively tackle various downstream NLP tasks and unify these tasks into generative …

被引用次数：53 相关文章所有 2 个版本

[PDF] arxiv.org

Jailbreaking and mitigation of vulnerabilities in large language models

B Peng, Z Bi, Q Niu, M Liu, P Feng, T Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have transformed artificial intelligence by advancing
natural language understanding and generation, enabling applications across fields beyond …

被引用次数：10 相关文章所有 5 个版本

[PDF] arxiv.org

Safer-instruct: Aligning language models with automated preference data

T Shi, K Chen, J Zhao - arXiv preprint arXiv:2311.08685, 2023 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) is a vital strategy for enhancing
model safety in language models. However, annotating preference data for RLHF is a …

被引用次数：10 相关文章所有 5 个版本

[PDF] arxiv.org

Defending jailbreak prompts via in-context adversarial game

Y Zhou, Y Han, H Zhuang, K Guo, Z Liang… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) demonstrate remarkable capabilities across diverse
applications. However, concerns regarding their security, particularly the vulnerability to …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Rethinking Machine Ethics--Can LLMs Perform Moral Reasoning through the Lens of Moral Theories?

J Zhou, M Hu, J Li, X Zhang, X Wu, I King… - arXiv preprint arXiv …, 2023 - arxiv.org

Making moral judgments is an essential step toward developing ethical AI systems.
Prevalent approaches are mostly implemented in a bottom-up manner, which uses a large …

被引用次数：14 相关文章所有 3 个版本

[PDF] arxiv.org

See what llms cannot answer: A self-challenge framework for uncovering llm weaknesses

Y Chen, Y Liu, J Yan, X Bai, M Zhong, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

The impressive performance of Large Language Models (LLMs) has consistently surpassed
numerous human-designed benchmarks, presenting new challenges in assessing the …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge

W Lu, Z Zeng, J Wang, Z Lu, Z Chen, H Zhuang… - arXiv preprint arXiv …, 2024 - arxiv.org

Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard
and generate harmful content. Existing jailbreaking defense methods have failed to address …

被引用次数：16 相关文章所有 2 个版本

[PDF] arxiv.org

Mission impossible: A statistical perspective on jailbreaking llms

J Su, J Kempe, K Ullrich - arXiv preprint arXiv:2408.01420, 2024 - arxiv.org

Large language models (LLMs) are trained on a deluge of text data with limited quality
control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as …

被引用次数：4 相关文章所有 5 个版本

[PDF] ieee.org

Application of Large Language Models in Cybersecurity: a Systematic Literature Review

I Hasanov, S Virtanen, A Hakkala, J Isoaho - IEEE Access, 2024 - ieeexplore.ieee.org

The emergence of Large Language Models (LLMs) is currently creating a major paradigm
shift in societies and businesses in the way digital technologies are used. While the …

高级搜索

QQ 群