Link: http://arxiv.org/abs/2502.16903v1
PDF Link: http://arxiv.org/pdf/2502.16903v1
Summary: Jailbreaking methods for large language models (LLMs) have gained increasingattention for building safe and responsible AI systems.
After analyzing 35jailbreak methods across six categories, we find that existing benchmarks,relying on universal LLM-based or keyword-matching scores, lack case-specificcriteria, leading to conflicting results.
In this paper, we introduce a morerobust evaluation framework for jailbreak methods, with a curated harmfulquestion dataset, detailed case-by-case evaluation guidelines, and a scoringsystem equipped with these guidelines.
Our experiments show that existingjailbreak methods exhibit better discrimination when evaluated using ourbenchmark.
Some jailbreak methods that claim to achieve over 90% attack successrate (ASR) on other benchmarks only reach a maximum of 30.
2% on our benchmark,providing a higher ceiling for more advanced jailbreak research; furthermore,using our scoring system reduces the variance of disagreements betweendifferent evaluator LLMs by up to 76.
33%.
This demonstrates its ability toprovide more fair and stable evaluation.
Published on arXiv on: 2025-02-24T06:57:27Z