Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

Link: http://arxiv.org/abs/2504.05652v1

PDF Link: http://arxiv.org/pdf/2504.05652v1

Summary: Large Language Models (LLMs) have become increasingly integral to a widerange of applications.

However, they still remain the threat of jailbreakattacks, where attackers manipulate designed prompts to make the models elicitmalicious outputs.

Analyzing jailbreak methods can help us delve into theweakness of LLMs and improve it.

In this paper, We reveal a vulnerability inlarge language models (LLMs), which we term Defense Threshold Decay (DTD), byanalyzing the attention weights of the model's output on input and subsequentoutput on prior output: as the model generates substantial benign content, itsattention weights shift from the input to prior output, making it moresusceptible to jailbreak attacks.

To demonstrate the exploitability of DTD, wepropose a novel jailbreak attack method, Sugar-Coated Poison (SCP), whichinduces the model to generate substantial benign content through benign inputand adversarial reasoning, subsequently producing malicious content.

Tomitigate such attacks, we introduce a simple yet effective defense strategy,POSD, which significantly reduces jailbreak success rates while preserving themodel's generalization capabilities.

Published on arXiv on: 2025-04-08T03:57:09Z