Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

Link: http://arxiv.org/abs/2502.12970v1

PDF Link: http://arxiv.org/pdf/2502.12970v1

Summary: The reasoning abilities of Large Language Models (LLMs) have demonstratedremarkable advancement and exceptional performance across diverse domains.

However, leveraging these reasoning capabilities to enhance LLM safety againstadversarial attacks and jailbreak queries remains largely unexplored.

To bridgethis gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm thatintegrates safety reflections of queries and responses into LLMs' generationprocess, unlocking a safety-aware reasoning mechanism.

This approach enablesself-evaluation at each reasoning step to create safety pivot tokens asindicators of the response's safety status.

Furthermore, in order to improvethe learning efficiency of pivot token prediction, we propose Contrastive PivotOptimization(CPO), which enhances the model's ability to perceive the safetystatus of dialogues.

Through this mechanism, LLMs dynamically adjust theirresponse strategies during reasoning, significantly enhancing their defensecapabilities against jailbreak attacks.

Extensive experimental resultsdemonstrate that R2D effectively mitigates various attacks and improves overallsafety, highlighting the substantial potential of safety-aware reasoning instrengthening LLMs' robustness against jailbreaks.

Published on arXiv on: 2025-02-18T15:48:46Z