Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models

Link: http://arxiv.org/abs/2505.15406v1

PDF Link: http://arxiv.org/pdf/2505.15406v1

Summary: The rise of Large Audio Language Models (LAMs) brings both potential andrisks, as their audio outputs may contain harmful or unethical content.

However, current research lacks a systematic, quantitative evaluation of LAMsafety especially against jailbreak attacks, which are challenging due to thetemporal and semantic nature of speech.

To bridge this gap, we introduceAJailBench, the first benchmark specifically designed to evaluate jailbreakvulnerabilities in LAMs.

We begin by constructing AJailBench-Base, a dataset of1,495 adversarial audio prompts spanning 10 policy-violating categories,converted from textual jailbreak attacks using realistic text to speechsynthesis.

Using this dataset, we evaluate several state-of-the-art LAMs andreveal that none exhibit consistent robustness across attacks.

To furtherstrengthen jailbreak testing and simulate more realistic attack conditions, wepropose a method to generate dynamic adversarial variants.

Our AudioPerturbation Toolkit (APT) applies targeted distortions across time, frequency,and amplitude domains.

To preserve the original jailbreak intent, we enforce asemantic consistency constraint and employ Bayesian optimization to efficientlysearch for perturbations that are both subtle and highly effective.

Thisresults in AJailBench-APT, an extended dataset of optimized adversarial audiosamples.

Our findings demonstrate that even small, semantically preservedperturbations can significantly reduce the safety performance of leading LAMs,underscoring the need for more robust and semantically aware defensemechanisms.

Published on arXiv on: 2025-05-21T11:47:47Z