Jailbreak Distillation: Renewable Safety Benchmarking

Link: http://arxiv.org/abs/2505.22037v1

PDF Link: http://arxiv.org/pdf/2505.22037v1

Summary: Large language models (LLMs) are rapidly deployed in critical applications,raising urgent needs for robust safety benchmarking.

We propose JailbreakDistillation (JBDistill), a novel benchmark construction framework that"distills" jailbreak attacks into high-quality and easily-updatable safetybenchmarks.

JBDistill utilizes a small set of development models and existingjailbreak attack algorithms to create a candidate prompt pool, then employsprompt selection algorithms to identify an effective subset of prompts assafety benchmarks.

JBDistill addresses challenges in existing safetyevaluation: the use of consistent evaluation prompts across models ensures faircomparisons and reproducibility.

It requires minimal human effort to rerun theJBDistill pipeline and produce updated benchmarks, alleviating concerns onsaturation and contamination.

Extensive experiments demonstrate our benchmarksgeneralize robustly to 13 diverse evaluation models held out from benchmarkconstruction, including proprietary, specialized, and newer-generation LLMs,significantly outperforming existing safety benchmarks in effectiveness whilemaintaining high separability and diversity.

Our framework thus provides aneffective, sustainable, and adaptable solution for streamlining safetyevaluation.

Published on arXiv on: 2025-05-28T06:59:46Z