Link: http://arxiv.org/abs/2502.12659v1
PDF Link: http://arxiv.org/pdf/2502.12659v1
Summary: The rapid development of large reasoning models, such as OpenAI-o3 andDeepSeek-R1, has led to significant improvements in complex reasoning overnon-reasoning large language models~(LLMs).
However, their enhancedcapabilities, combined with the open-source access of models like DeepSeek-R1,raise serious safety concerns, particularly regarding their potential formisuse.
In this work, we present a comprehensive safety assessment of thesereasoning models, leveraging established safety benchmarks to evaluate theircompliance with safety regulations.
Furthermore, we investigate theirsusceptibility to adversarial attacks, such as jailbreaking and promptinjection, to assess their robustness in real-world applications.
Through ourmulti-faceted analysis, we uncover four key findings: (1) There is asignificant safety gap between the open-source R1 models and the o3-mini model,on both safety benchmark and attack, suggesting more safety effort on R1 isneeded.
(2) The distilled reasoning model shows poorer safety performancecompared to its safety-aligned base models.
(3) The stronger the model'sreasoning ability, the greater the potential harm it may cause when answeringunsafe questions.
(4) The thinking process in R1 models pose greater safetyconcerns than their final answers.
Our study provides insights into thesecurity implications of reasoning models and highlights the need for furtheradvancements in R1 models' safety to close the gap.
Published on arXiv on: 2025-02-18T09:06:07Z