Link: http://arxiv.org/abs/2505.22037v1
PDF Link: http://arxiv.org/pdf/2505.22037v1
Summary: Large language models (LLMs) are rapidly deployed in critical applications,raising urgent needs for robust safety benchmarking.
We propose JailbreakDistillation (JBDistill), a novel benchmark construction framework that"distills" jailbreak attacks into high-quality and easily-updatable safetybenchmarks.
JBDistill utilizes a small set of development models and existingjailbreak attack algorithms to create a candidate prompt pool, then employsprompt selection algorithms to identify an effective subset of prompts assafety benchmarks.
JBDistill addresses challenges in existing safetyevaluation: the use of consistent evaluation prompts across models ensures faircomparisons and reproducibility.
It requires minimal human effort to rerun theJBDistill pipeline and produce updated benchmarks, alleviating concerns onsaturation and contamination.
Extensive experiments demonstrate our benchmarksgeneralize robustly to 13 diverse evaluation models held out from benchmarkconstruction, including proprietary, specialized, and newer-generation LLMs,significantly outperforming existing safety benchmarks in effectiveness whilemaintaining high separability and diversity.
Our framework thus provides aneffective, sustainable, and adaptable solution for streamlining safetyevaluation.
Published on arXiv on: 2025-05-28T06:59:46Z