Link: http://arxiv.org/abs/2502.12893v1
PDF Link: http://arxiv.org/pdf/2502.12893v1
Summary: Large Reasoning Models (LRMs) have recently extended their powerful reasoningcapabilities to safety checks-using chain-of-thought reasoning to decidewhether a request should be answered.
While this new approach offers apromising route for balancing model utility and safety, its robustness remainsunderexplored.
To address this gap, we introduce Malicious-Educator, abenchmark that disguises extremely dangerous or malicious requests beneathseemingly legitimate educational prompts.
Our experiments reveal severesecurity flaws in popular commercial-grade LRMs, including OpenAI o1/o3,DeepSeek-R1, and Gemini 2.
0 Flash Thinking.
For instance, although OpenAI's o1model initially maintains a high refusal rate of about 98%, subsequent modelupdates significantly compromise its safety; and attackers can easily extractcriminal strategies from DeepSeek-R1 and Gemini 2.
0 Flash Thinking without anyadditional tricks.
To further highlight these vulnerabilities, we proposeHijacking Chain-of-Thought (H-CoT), a universal and transferable attack methodthat leverages the model's own displayed intermediate reasoning to jailbreakits safety reasoning mechanism.
Under H-CoT, refusal rates sharplydecline-dropping from 98% to below 2%-and, in some instances, even transforminitially cautious tones into ones that are willing to provide harmful content.
We hope these findings underscore the urgent need for more robust safetymechanisms to preserve the benefits of advanced reasoning capabilities withoutcompromising ethical standards.
Published on arXiv on: 2025-02-18T14:29:12Z