Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

Link: http://arxiv.org/abs/2506.23576v1

PDF Link: http://arxiv.org/pdf/2506.23576v1

Summary: Recent advances in large language models (LLMs) have raised concerns aboutjailbreaking attacks, i.

e.

, prompts that bypass safety mechanisms.

This paperinvestigates the use of multi-agent LLM systems as a defence against suchattacks.

We evaluate three jailbreaking strategies, including the originalAutoDefense attack and two from Deepleaps: BetterDan and JB.

Reproducing theAutoDefense framework, we compare single-agent setups with two- and three-agentconfigurations.

Our results show that multi-agent systems enhance resistance tojailbreaks, especially by reducing false negatives.

However, its effectivenessvaries by attack type, and it introduces trade-offs such as increased falsepositives and computational overhead.

These findings point to the limitationsof current automated defences and suggest directions for improving alignmentrobustness in future LLM systems.

Published on arXiv on: 2025-06-30T07:29:07Z