SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Link: http://arxiv.org/abs/2505.16186v1

PDF Link: http://arxiv.org/pdf/2505.16186v1

Summary: Large Reasoning Models (LRMs) introduce a new generation paradigm ofexplicitly reasoning before answering, leading to remarkable improvements incomplex tasks.

However, they pose great safety risks against harmful queriesand adversarial attacks.

While recent mainstream safety efforts on LRMs,supervised fine-tuning (SFT), improve safety performance, we find thatSFT-aligned models struggle to generalize to unseen jailbreak prompts.

Afterthorough investigation of LRMs' generation, we identify a safety aha momentthat can activate safety reasoning and lead to a safe response.

This aha momenttypically appears in the `key sentence', which follows models' queryunderstanding process and can indicate whether the model will proceed safely.

Based on these insights, we propose SafeKey, including two complementaryobjectives to better activate the safety aha moment in the key sentence: (1) aDual-Path Safety Head to enhance the safety signal in the model's internalrepresentations before the key sentence, and (2) a Query-Mask Modelingobjective to improve the models' attention on its query understanding, whichhas important safety hints.

Experiments across multiple safety benchmarksdemonstrate that our methods significantly improve safety generalization to awide range of jailbreak attacks and out-of-distribution harmful prompts,lowering the average harmfulness rate by 9.

6\%, while maintaining generalabilities.

Our analysis reveals how SafeKey enhances safety by reshapinginternal attention and improving the quality of hidden representations.

Published on arXiv on: 2025-05-22T03:46:03Z