Link: http://arxiv.org/abs/2508.20848v1
PDF Link: http://arxiv.org/pdf/2508.20848v1
Summary: Accurately determining whether a jailbreak attempt has succeeded is afundamental yet unresolved challenge.
Existing evaluation methods rely onmisaligned proxy indicators or naive holistic judgments.
They frequentlymisinterpret model responses, leading to inconsistent and subjectiveassessments that misalign with human perception.
To address this gap, weintroduce JADES (Jailbreak Assessment via Decompositional Scoring), a universaljailbreak evaluation framework.
Its key mechanism is to automatically decomposean input harmful question into a set of weighted sub-questions, score eachsub-answer, and weight-aggregate the sub-scores into a final decision.
JADESalso incorporates an optional fact-checking module to strengthen the detectionof hallucinations in jailbreak responses.
We validate JADES on JailbreakQR, anewly introduced benchmark proposed in this work, consisting of 400 pairs ofjailbreak prompts and responses, each meticulously annotated by humans.
In abinary setting (success/failure), JADES achieves 98.
5% agreement with humanevaluators, outperforming strong baselines by over 9%.
Re-evaluating fivepopular attacks on four LLMs reveals substantial overestimation (e.
g.
, LAA'sattack success rate on GPT-3.
5-Turbo drops from 93% to 69%).
Our results showthat JADES could deliver accurate, consistent, and interpretable evaluations,providing a reliable basis for measuring future jailbreak attacks.
Published on arXiv on: 2025-08-28T14:40:27Z