Link: http://arxiv.org/abs/2511.06852v1
PDF Link: http://arxiv.org/pdf/2511.06852v1
Summary: Safety alignment instills in Large Language Models (LLMs) a critical capacityto refuse malicious requests.
Prior works have modeled this refusal mechanismas a single linear direction in the activation space.
We posit that this is anoversimplification that conflates two functionally distinct neural processes:the detection of harm and the execution of a refusal.
In this work, wedeconstruct this single representation into a Harm Detection Direction and aRefusal Execution Direction.
Leveraging this fine-grained model, we introduceDifferentiated Bi-Directional Intervention (DBDI), a new white-box frameworkthat precisely neutralizes the safety alignment at critical layer.
DBDI appliesadaptive projection nullification to the refusal execution direction whilesuppressing the harm detection direction via direct steering.
Extensiveexperiments demonstrate that DBDI outperforms prominent jailbreaking methods,achieving up to a 97.
88\% attack success rate on models such as Llama-2.
Byproviding a more granular and mechanistic framework, our work offers a newdirection for the in-depth understanding of LLM safety alignment.
Published on arXiv on: 2025-11-10T08:52:34Z