Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

Link: http://arxiv.org/abs/2505.16241v1

PDF Link: http://arxiv.org/pdf/2505.16241v1

Summary: Recently, Large Reasoning Models (LRMs) have demonstrated superior logicalcapabilities compared to traditional Large Language Models (LLMs), gainingsignificant attention.

Despite their impressive performance, the potential forstronger reasoning abilities to introduce more severe security vulnerabilitiesremains largely underexplored.

Existing jailbreak methods often struggle tobalance effectiveness with robustness against adaptive safety mechanisms.

Inthis work, we propose SEAL, a novel jailbreak attack that targets LRMs throughan adaptive encryption pipeline designed to override their reasoning processesand evade potential adaptive alignment.

Specifically, SEAL introduces a stackedencryption approach that combines multiple ciphers to overwhelm the modelsreasoning capabilities, effectively bypassing built-in safety mechanisms.

Tofurther prevent LRMs from developing countermeasures, we incorporate twodynamic strategies - random and adaptive - that adjust the cipher length,order, and combination.

Extensive experiments on real-world reasoning models,including DeepSeek-R1, Claude Sonnet, and OpenAI GPT-o4, validate theeffectiveness of our approach.

Notably, SEAL achieves an attack success rate of80.

8% on GPT o4-mini, outperforming state-of-the-art baselines by a significantmargin of 27.

2%.

Warning: This paper contains examples of inappropriate,offensive, and harmful content.

Published on arXiv on: 2025-05-22T05:19:42Z