Link: http://arxiv.org/abs/2504.16489v1
PDF Link: http://arxiv.org/pdf/2504.16489v1
Summary: Multi-Agent Debate (MAD), leveraging collaborative interactions among LargeLanguage Models (LLMs), aim to enhance reasoning capabilities in complex tasks.
However, the security implications of their iterative dialogues androle-playing characteristics, particularly susceptibility to jailbreak attackseliciting harmful content, remain critically underexplored.
This papersystematically investigates the jailbreak vulnerabilities of four prominent MADframeworks built upon leading commercial LLMs (GPT-4o, GPT-4, GPT-3.
5-turbo,and DeepSeek) without compromising internal agents.
We introduce a novelstructured prompt-rewriting framework specifically designed to exploit MADdynamics via narrative encapsulation, role-driven escalation, iterativerefinement, and rhetorical obfuscation.
Our extensive experiments demonstratethat MAD systems are inherently more vulnerable than single-agent setups.
Crucially, our proposed attack methodology significantly amplifies thisfragility, increasing average harmfulness from 28.
14% to 80.
34% and achievingattack success rates as high as 80% in certain scenarios.
These findings revealintrinsic vulnerabilities in MAD architectures and underscore the urgent needfor robust, specialized defenses prior to real-world deployment.
Published on arXiv on: 2025-04-23T08:01:50Z