Link: http://arxiv.org/abs/2507.07735v1
PDF Link: http://arxiv.org/pdf/2507.07735v1
Summary: Jailbreak attacks reveal critical vulnerabilities in Large Language Models(LLMs) by causing them to generate harmful or unethical content.
Evaluatingthese threats is particularly challenging due to the evolving nature of LLMsand the sophistication required in effectively probing their vulnerabilities.
Current benchmarks and evaluation methods struggle to fully address thesechallenges, leaving gaps in the assessment of LLM vulnerabilities.
In thispaper, we review existing jailbreak evaluation practices and identify threeassumed desiderata for an effective jailbreak evaluation protocol.
To addressthese challenges, we introduce GuardVal, a new evaluation protocol thatdynamically generates and refines jailbreak prompts based on the defender LLM'sstate, providing a more accurate assessment of defender LLMs' capacity tohandle safety-critical situations.
Moreover, we propose a new optimizationmethod that prevents stagnation during prompt refinement, ensuring thegeneration of increasingly effective jailbreak prompts that expose deeperweaknesses in the defender LLMs.
We apply this protocol to a diverse set ofmodels, from Mistral-7b to GPT-4, across 10 safety domains.
Our findingshighlight distinct behavioral patterns among the models, offering acomprehensive view of their robustness.
Furthermore, our evaluation processdeepens the understanding of LLM behavior, leading to insights that can informfuture research and drive the development of more secure models.
Published on arXiv on: 2025-07-10T13:15:20Z