Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate

Link: http://arxiv.org/abs/2504.16489v1

PDF Link: http://arxiv.org/pdf/2504.16489v1

Summary: Multi-Agent Debate (MAD), leveraging collaborative interactions among LargeLanguage Models (LLMs), aim to enhance reasoning capabilities in complex tasks.

However, the security implications of their iterative dialogues androle-playing characteristics, particularly susceptibility to jailbreak attackseliciting harmful content, remain critically underexplored.

This papersystematically investigates the jailbreak vulnerabilities of four prominent MADframeworks built upon leading commercial LLMs (GPT-4o, GPT-4, GPT-3.

5-turbo,and DeepSeek) without compromising internal agents.

We introduce a novelstructured prompt-rewriting framework specifically designed to exploit MADdynamics via narrative encapsulation, role-driven escalation, iterativerefinement, and rhetorical obfuscation.

Our extensive experiments demonstratethat MAD systems are inherently more vulnerable than single-agent setups.

Crucially, our proposed attack methodology significantly amplifies thisfragility, increasing average harmfulness from 28.

14% to 80.

34% and achievingattack success rates as high as 80% in certain scenarios.

These findings revealintrinsic vulnerabilities in MAD architectures and underscore the urgent needfor robust, specialized defenses prior to real-world deployment.

Published on arXiv on: 2025-04-23T08:01:50Z