Link: http://arxiv.org/abs/2501.01872v1
PDF Link: http://arxiv.org/pdf/2501.01872v1
Summary: Despite significant efforts to align large language models with human valuesand ethical guidelines, these models remain susceptible to sophisticatedjailbreak attacks that exploit their reasoning capabilities.
Traditional safetymechanisms often focus on detecting explicit malicious intent, leaving deepervulnerabilities unaddressed.
In this work, we introduce a jailbreak technique,POATE (Polar Opposite query generation, Adversarial Template construction, andElaboration), which leverages contrastive reasoning to elicit unethicalresponses.
POATE generates prompts with semantically opposite intents andcombines them with adversarial templates to subtly direct models towardproducing harmful responses.
We conduct extensive evaluations across sixdiverse language model families of varying parameter sizes, including LLaMA3,Gemma2, Phi3, and GPT-4, to demonstrate the robustness of the attack, achievingsignificantly higher attack success rates (~44%) compared to existing methods.
We evaluate our proposed attack against seven safety defenses, revealing theirlimitations in addressing reasoning-based vulnerabilities.
To counteract this,we propose a defense strategy that improves reasoning robustness throughchain-of-thought prompting and reverse thinking, mitigating reasoning-drivenadversarial exploits.
Published on arXiv on: 2025-01-03T15:40:03Z