Link: http://arxiv.org/abs/2505.13862v1
PDF Link: http://arxiv.org/pdf/2505.13862v1
Summary: Large language models (LLMs) have achieved remarkable capabilities but remainvulnerable to adversarial prompts known as jailbreaks, which can bypass safetyalignment and elicit harmful outputs.
Despite growing efforts in LLM safetyresearch, existing evaluations are often fragmented, focused on isolated attackor defense techniques, and lack systematic, reproducible analysis.
In thiswork, we introduce PandaGuard, a unified and modular framework that models LLMjailbreak safety as a multi-agent system comprising attackers, defenders, andjudges.
Our framework implements 19 attack methods and 12 defense mechanisms,along with multiple judgment strategies, all within a flexible pluginarchitecture supporting diverse LLM interfaces, multiple interaction modes, andconfiguration-driven experimentation that enhances reproducibility andpractical deployment.
Built on this framework, we develop PandaBench, acomprehensive benchmark that evaluates the interactions between theseattack/defense methods across 49 LLMs and various judgment approaches,requiring over 3 billion tokens to execute.
Our extensive evaluation revealskey insights into model vulnerabilities, defense cost-performance trade-offs,and judge consistency.
We find that no single defense is optimal across alldimensions and that judge disagreement introduces nontrivial variance in safetyassessments.
We release the code, configurations, and evaluation results tosupport transparent and reproducible research in LLM safety.
Published on arXiv on: 2025-05-20T03:14:57Z