Trustllm: Trustworthiness in large language models

Y Huang, L Sun, H Wang, S Wu, Q Zhang, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs), exemplified by ChatGPT, have gained considerable
attention for their excellent natural language processing capabilities. Nonetheless, these …

[HTML][HTML] Position: TrustLLM: Trustworthiness in large language models

Y Huang, L Sun, H Wang, S Wu… - International …, 2024 - proceedings.mlr.press
Large language models (LLMs) have gained considerable attention for their excellent
natural language processing capabilities. Nonetheless, these LLMs present many …

Sleeper agents: Training deceptive llms that persist through safety training

E Hubinger, C Denison, J Mu, M Lambert… - arXiv preprint arXiv …, 2024 - arxiv.org
Humans are capable of strategically deceptive behavior: behaving helpfully in most
situations, but then behaving very differently in order to pursue alternative objectives when …

[PDF][PDF] Universal vulnerabilities in large language models: Backdoor attacks for in-context learning

S Zhao, M Jia, LA Tuan, F Pan… - arXiv preprint arXiv …, 2024 - researchgate.net
In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has
demonstrated high efficacy in several NLP tasks, especially in few-shot settings. Despite …

Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review

P Cheng, Z Wu, W Du, H Zhao, W Lu, G Liu - arXiv preprint arXiv …, 2023 - arxiv.org
Applicating third-party data and models has become a new paradigm for language modeling
in NLP, which also introduces some potential security vulnerabilities because attackers can …

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

Z Chen, Z Xiang, C Xiao, D Song, B Li - arXiv preprint arXiv:2407.12784, 2024 - arxiv.org
LLM agents have demonstrated remarkable performance across various applications,
primarily due to their advanced capabilities in reasoning, utilizing external knowledge and …

Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models

Y Li, H Huang, Y Zhao, X Ma, J Sun - arXiv preprint arXiv:2408.12798, 2024 - arxiv.org
Generative Large Language Models (LLMs) have made significant strides across various
tasks, but they remain vulnerable to backdoor attacks, where specific triggers in the prompt …

Transferring backdoors between large language models by knowledge distillation

P Cheng, Z Wu, T Ju, W Du, ZZG Liu - arXiv preprint arXiv:2408.09878, 2024 - arxiv.org
Backdoor Attacks have been a serious vulnerability against Large Language Models
(LLMs). However, previous methods only reveal such risk in specific models, or present …

Watch out for your agents! investigating backdoor threats to llm-based agents

W Yang, X Bi, Y Lin, S Chen, J Zhou, X Sun - arXiv preprint arXiv …, 2024 - arxiv.org
Leveraging the rapid development of Large Language Models LLMs, LLM-based agents
have been developed to handle various real-world applications, including finance …

Mitigating backdoor threats to large language models: Advancement and challenges

Q Liu, W Mo, T Tong, J Xu, F Wang, C Xiao… - arXiv preprint arXiv …, 2024 - arxiv.org
The advancement of Large Language Models (LLMs) has significantly impacted various
domains, including Web search, healthcare, and software development. However, as these …