Link: http://arxiv.org/abs/2508.19697v1
PDF Link: http://arxiv.org/pdf/2508.19697v1
Summary: Current safety alignment for large language models(LLMs) continues to presentvulnerabilities, given that adversarial prompting can effectively bypass theirsafety measures.
Our investigation shows that these safety mechanismspredominantly depend on a limited subset of attention heads: removing orablating these heads can severely compromise model safety.
To identify andevaluate these safety-critical components, we introduce RDSHA, a targetedablation method that leverages the model's refusal direction to pinpointattention heads mostly responsible for safety behaviors.
Further analysis showsthat existing jailbreak attacks exploit this concentration by selectivelybypassing or manipulating these critical attention heads.
To address thisissue, we propose AHD, a novel training strategy designed to promote thedistributed encoding of safety-related behaviors across numerous attentionheads.
Experimental results demonstrate that AHD successfully distributessafety-related capabilities across more attention heads.
Moreover, evaluationsunder several mainstream jailbreak attacks show that models trained with AHDexhibit considerably stronger safety robustness, while maintaining overallfunctional utility.
Published on arXiv on: 2025-08-27T09:06:28Z