Link: http://arxiv.org/abs/2502.03052v1
PDF Link: http://arxiv.org/pdf/2502.03052v1
Summary: Jailbreaking attacks can effectively manipulate open-source large languagemodels (LLMs) to produce harmful responses.
However, these attacks exhibitlimited transferability, failing to disrupt proprietary LLMs consistently.
Toreliably identify vulnerabilities in proprietary LLMs, this work investigatesthe transferability of jailbreaking attacks by analysing their impact on themodel's intent perception.
By incorporating adversarial sequences, theseattacks can redirect the source LLM's focus away from malicious-intent tokensin the original input, thereby obstructing the model's intent recognition andeliciting harmful responses.
Nevertheless, these adversarial sequences fail tomislead the target LLM's intent perception, allowing the target LLM to refocuson malicious-intent tokens and abstain from responding.
Our analysis furtherreveals the inherent distributional dependency within the generated adversarialsequences, whose effectiveness stems from overfitting the source LLM'sparameters, resulting in limited transferability to target LLMs.
To this end,we propose the Perceived-importance Flatten (PiF) method, which uniformlydisperses the model's focus across neutral-intent tokens in the original input,thus obscuring malicious-intent tokens without relying on overfittedadversarial sequences.
Extensive experiments demonstrate that PiF provides aneffective and efficient red-teaming evaluation for proprietary LLMs.
Published on arXiv on: 2025-02-05T10:29:54Z