AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Link: http://arxiv.org/abs/2412.08608v1

PDF Link: http://arxiv.org/pdf/2412.08608v1

Summary: Recent advancements in large audio-language models (LALMs) have enabledspeech-based user interactions, significantly enhancing user experience andaccelerating the deployment of LALMs in real-world applications.

However,ensuring the safety of LALMs is crucial to prevent risky outputs that may raisesocietal concerns or violate AI regulations.

Despite the importance of thisissue, research on jailbreaking LALMs remains limited due to their recentemergence and the additional technical challenges they present compared toattacks on DNN-based audio models.

Specifically, the audio encoders in LALMs,which involve discretization operations, often lead to gradient shattering,hindering the effectiveness of attacks relying on gradient-based optimizations.

The behavioral variability of LALMs further complicates the identification ofeffective (adversarial) optimization targets.

Moreover, enforcing stealthinessconstraints on adversarial audio waveforms introduces a reduced, non-convexfeasible solution space, further intensifying the challenges of theoptimization process.

To overcome these challenges, we develop AdvWave, thefirst jailbreak framework against LALMs.

We propose a dual-phase optimizationmethod that addresses gradient shattering, enabling effective end-to-endgradient-based optimization.

Additionally, we develop an adaptive adversarialtarget search algorithm that dynamically adjusts the adversarial optimizationtarget based on the response patterns of LALMs for specific queries.

To ensurethat adversarial audio remains perceptually natural to human listeners, wedesign a classifier-guided optimization approach that generates adversarialnoise resembling common urban sounds.

Extensive evaluations on multipleadvanced LALMs demonstrate that AdvWave outperforms baseline methods, achievinga 40% higher average jailbreak attack success rate.

Published on arXiv on: 2024-12-11T18:30:57Z