Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language models

G Shen, D Zhao, Y Dong, X He, Y Zeng - arXiv preprint arXiv:2410.02298, 2024 - arxiv.org
As large language models (LLMs) become integral to various applications, ensuring both
their safety and utility is paramount. Jailbreak attacks, which manipulate LLMs into …

Artificial intelligence-powered chatbots in search engines: a cross-sectional study on the quality and risks of drug information for patients

W Andrikyan, SM Sametinger, F Kosfeld… - BMJ Quality & …, 2025 - qualitysafety.bmj.com
Background Search engines often serve as a primary resource for patients to obtain drug
information. However, the search engine market is rapidly changing due to the introduction …

SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment

Q Liu, F Wang, C Xiao, M Chen - arXiv preprint arXiv:2410.14676, 2024 - arxiv.org
Existing preference alignment is a one-size-fits-all alignment mechanism, where the part of
the large language model (LLM) parametric knowledge with non-preferred features is …

A gradient analysis framework for rewarding good and penalizing bad examples in language models

YL Tuan, WY Wang - arXiv preprint arXiv:2408.16751, 2024 - arxiv.org
Beyond maximum likelihood estimation (MLE), the standard objective of a language model
(LM) that optimizes good examples probabilities, many studies have explored ways that also …

OR-Bench: An Over-Refusal Benchmark for Large Language Models

J Cui, WL Chiang, I Stoica, CJ Hsieh - arXiv preprint arXiv:2405.20947, 2024 - arxiv.org
Large Language Models (LLMs) require careful safety alignment to prevent malicious
outputs. While significant research focuses on mitigating harmful content generation, the …

Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts

T Fu, Y Hou, J McAuley, R Yan - arXiv preprint arXiv:2408.05094, 2024 - arxiv.org
The task of multi-objective alignment aims at balancing and controlling the different
alignment objectives (eg, helpfulness, harmlessness and honesty) of large language models …

You Know What I'm Saying: Jailbreak Attack via Implicit Reference

T Wu, L Mei, R Yuan, L Li, W Xue, Y Guo - arXiv preprint arXiv:2410.03857, 2024 - arxiv.org
While recent advancements in large language model (LLM) alignment have enabled the
effective identification of malicious objectives involving scene nesting and keyword rewriting …

Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options

G Góral, E Wiśnios, P Sankowski… - arXiv preprint arXiv …, 2024 - arxiv.org
Decision-making under full alignment requires balancing between reasoning and
faithfulness-a challenge for large language models (LLMs). This study explores whether …

Achieving Human-like Chatbots from Reasoning and Optimization Perspectives

YL Tuan - 2024 - search.proquest.com
Human-like chatbots–machines that can act as humans to chat about any topic–need to
listen, understand, reason, respond, and interactively learn to optimize the whole process …

OR-Bench: An Over-Refusal Benchmark for Large Language Models

ORPR Rate - openreview.net
Large Language Models (LLMs) require careful safety alignment to prevent malicious
outputs. While significant research focuses on mitigating harmful content generation, the …