Llm self defense: By self examination, llms know they are being tricked

[HTML][HTML] A survey on large language model (llm) security and privacy: The good, the bad, and the ugly

Y Yao, J Duan, K Xu, Y Cai, Z Sun, Y Zhang - High-Confidence Computing, 2024 - Elsevier

Abstract Large Language Models (LLMs), such as ChatGPT and Bard, have revolutionized
natural language understanding and generation. They possess deep language …

被引用次数：124 相关文章所有 11 个版本

[PDF] arxiv.org

Survey of vulnerabilities in large language models revealed by adversarial attacks

E Shayegani, MAA Mamun, Y Fu, P Zaree… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as
they integrate more deeply into complex systems, the urgency to scrutinize their security …

被引用次数：47 相关文章所有 2 个版本

[PDF] aaai.org

Visual adversarial examples jailbreak aligned large language models

X Qi, K Huang, A Panda, P Henderson… - Proceedings of the …, 2024 - ojs.aaai.org

Warning: this paper contains data, prompts, and model outputs that are offensive in nature.
Recently, there has been a surge of interest in integrating vision into Large Language …

被引用次数：61 相关文章所有 5 个版本

[PDF] arxiv.org

Combating misinformation in the age of llms: Opportunities and challenges

C Chen, K Shu - arXiv preprint arXiv:2311.05656, 2023 - arxiv.org

Misinformation such as fake news and rumors is a serious threat on information ecosystems
and public trust. The emergence of Large Language Models (LLMs) has great potential to …

被引用次数：48 相关文章所有 4 个版本

[PDF] arxiv.org

Explore, establish, exploit: Red teaming language models from scratch

S Casper, J Lin, J Kwon, G Culp… - arXiv preprint arXiv …, 2023 - arxiv.org

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as
toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order …

被引用次数：49 相关文章所有 3 个版本

[PDF] arxiv.org

Defending against alignment-breaking attacks via robustly aligned llm

B Cao, Y Cao, L Lin, J Chen - arXiv preprint arXiv:2309.14348, 2023 - arxiv.org

Recently, Large Language Models (LLMs) have made significant advancements and are
now widely used across various domains. Unfortunately, there has been a rising concern …

被引用次数：48 相关文章所有 3 个版本

[PDF] arxiv.org

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

被引用次数：29 相关文章所有 3 个版本

[PDF] mit.edu

Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies

L Pan, M Saxon, W Xu, D Nathani, X Wang… - Transactions of the …, 2024 - direct.mit.edu

While large language models (LLMs) have shown remarkable effectiveness in various NLP
tasks, they are still prone to issues such as hallucination, unfaithful reasoning, and toxicity. A …

被引用次数：7 相关文章所有 4 个版本

[PDF] arxiv.org

Autodan: Automatic and interpretable adversarial attacks on large language models

S Zhu, R Zhang, B An, G Wu, J Barrow, Z Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Safety alignment of Large Language Models (LLMs) can be compromised with manual
jailbreak attacks and (automatic) adversarial attacks. Recent work suggests that patching …

被引用次数：21 相关文章所有 4 个版本

[PDF] arxiv.org

Safedecoding: Defending against jailbreak attacks via safety-aware decoding

Z Xu, F Jiang, L Niu, J Jia, BY Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

As large language models (LLMs) become increasingly integrated into real-world
applications such as code generation and chatbot assistance, extensive efforts have been …

被引用次数：24 相关文章所有 4 个版本

高级搜索

QQ 群