Jailbreaking? One Step Is Enough!

Link: http://arxiv.org/abs/2412.12621v1

PDF Link: http://arxiv.org/pdf/2412.12621v1

Summary: Large language models (LLMs) excel in various tasks but remain vulnerable tojailbreak attacks, where adversaries manipulate prompts to generate harmfuloutputs.

Examining jailbreak prompts helps uncover the shortcomings of LLMs.

However, current jailbreak methods and the target model's defenses are engagedin an independent and adversarial process, resulting in the need for frequentattack iterations and redesigning attacks for different models.

To addressthese gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism thatdisguises the attack intention as the "defense".

intention against harmfulcontent.

Specifically, REDA starts from the target response, guiding the modelto embed harmful content within its defensive measures, thereby relegatingharmful content to a secondary role and making the model believe it isperforming a defensive task.

The attacking model considers that it is guidingthe target model to deal with harmful content, while the target model thinks itis performing a defensive task, creating an illusion of cooperation between thetwo.

Additionally, to enhance the model's confidence and guidance in"defensive" intentions, we adopt in-context learning (ICL) with a small numberof attack examples and construct a corresponding dataset of attack examples.

Extensive evaluations demonstrate that the REDA method enables cross-modelattacks without the need to redesign attack strategies for different models,enables successful jailbreak in one iteration, and outperforms existing methodson both open-source and closed-source models.

Published on arXiv on: 2024-12-17T07:33:41Z