Decoding-Time Language Model Alignment with Multiple Objectives

R Shi, Y Chen, Y Hu, AL Liu, N Smith… - arXiv preprint arXiv …, 2024 - arxiv.org
Aligning language models (LMs) to human preferences has emerged as a critical pursuit,
enabling these models to better serve diverse user needs. Existing methods primarily focus …

AttackER: Towards Enhancing Cyber-Attack Attribution with a Named Entity Recognition Dataset

P Deka, S Rajapaksha, R Rani, A Almutairi… - arXiv preprint arXiv …, 2024 - arxiv.org
Cyber-attack attribution is an important process that allows experts to put in place attacker-
oriented countermeasures and legal actions. The analysts mainly perform attribution …

The Art of Refusal: A Survey of Abstention in Large Language Models

B Wen, J Yao, S Feng, C Xu, Y Tsvetkov… - arXiv preprint arXiv …, 2024 - arxiv.org
Abstention, the refusal of large language models (LLMs) to provide an answer, is
increasingly recognized for its potential to mitigate hallucinations and enhance safety in …

On the Vulnerability of Safety Alignment in Open-Access LLMs

J Yi, R Ye, Q Chen, B Zhu, S Chen, D Lian… - Findings of the …, 2024 - aclanthology.org
Large language models (LLMs) possess immense capabilities but are susceptible to
malicious exploitation. To mitigate the risk, safety alignment is employed to align LLMs with …

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

Y Li, Y Liu, Y Li, L Shi, G Deng, S Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have transformed the field of natural language processing,
but they remain susceptible to jailbreaking attacks that exploit their capabilities to generate …

Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arXiv preprint arXiv:2405.18641, 2024 - arxiv.org
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

CT Leong, Y Cheng, K Xu, J Wang, H Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
The existing safety alignment of Large Language Models (LLMs) is found fragile and could
be easily attacked through different strategies, such as through fine-tuning on a few harmful …

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

N Bhavsar, J Jordan, S Hakimov… - arXiv preprint arXiv …, 2024 - arxiv.org
What makes a good Large Language Model (LLM)? That it performs well on the relevant
benchmarks--which hopefully measure, with some validity, the presence of capabilities that …

Chained Tuning Leads to Biased Forgetting

M Ung, AY Sun, S Bell, L Sagun, A Williams - Trustworthy Multi-modal … - openreview.net
Large language models (LLMs) are often fine-tuned for use on downstream tasks, though
this can degrade capabilities learned during previous training. This phenomenon, often …