Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

Link: http://arxiv.org/abs/2501.01872v1

PDF Link: http://arxiv.org/pdf/2501.01872v1

Summary: Despite significant efforts to align large language models with human valuesand ethical guidelines, these models remain susceptible to sophisticatedjailbreak attacks that exploit their reasoning capabilities.

Traditional safetymechanisms often focus on detecting explicit malicious intent, leaving deepervulnerabilities unaddressed.

In this work, we introduce a jailbreak technique,POATE (Polar Opposite query generation, Adversarial Template construction, andElaboration), which leverages contrastive reasoning to elicit unethicalresponses.

POATE generates prompts with semantically opposite intents andcombines them with adversarial templates to subtly direct models towardproducing harmful responses.

We conduct extensive evaluations across sixdiverse language model families of varying parameter sizes, including LLaMA3,Gemma2, Phi3, and GPT-4, to demonstrate the robustness of the attack, achievingsignificantly higher attack success rates (~44%) compared to existing methods.

We evaluate our proposed attack against seven safety defenses, revealing theirlimitations in addressing reasoning-based vulnerabilities.

To counteract this,we propose a defense strategy that improves reasoning robustness throughchain-of-thought prompting and reverse thinking, mitigating reasoning-drivenadversarial exploits.

Published on arXiv on: 2025-01-03T15:40:03Z