Improving LLM Safety Alignment with Dual-Objective Optimization

Link: http://arxiv.org/abs/2503.03710v1

PDF Link: http://arxiv.org/pdf/2503.03710v1

Summary: Existing training-time safety alignment techniques for large language models(LLMs) remain vulnerable to jailbreak attacks.

Direct preference optimization(DPO), a widely deployed alignment method, exhibits limitations in bothexperimental and theoretical contexts as its loss function proves suboptimalfor refusal learning.

Through gradient-based analysis, we identify theseshortcomings and propose an improved safety alignment that disentangles DPOobjectives into two components: (1) robust refusal training, which encouragesrefusal even when partial unsafe generations are produced, and (2) targetedunlearning of harmful knowledge.

This approach significantly increases LLMrobustness against a wide range of jailbreak attacks, including prefilling,suffix, and multi-turn attacks across both in-distribution andout-of-distribution scenarios.

Furthermore, we introduce a method to emphasizecritical refusal tokens by incorporating a reward-based token-level weightingmechanism for refusal learning, which further improves the robustness againstadversarial exploits.

Our research also suggests that robustness to jailbreakattacks is correlated with token distribution shifts in the training processand internal representations of refusal and harmful tokens, offering valuabledirections for future research in LLM safety alignment.

The code is availableat https://github.

com/wicai24/DOOR-Alignment

Published on arXiv on: 2025-03-05T18:01:05Z