Link: http://arxiv.org/abs/2507.01513v1
PDF Link: http://arxiv.org/pdf/2507.01513v1
Summary: By incorporating visual inputs, Multimodal Large Language Models (MLLMs)extend LLMs to support visual reasoning.
However, this integration alsointroduces new vulnerabilities, making MLLMs susceptible to multimodaljailbreak attacks and hindering their safe deployment.
Existing defense methods,including Image-to-Text Translation, Safe Prompting, and Multimodal SafetyTuning, attempt to address this by aligning multimodal inputs with LLMs'built-in safeguards.
Yet, they fall short in uncovering root causes ofmultimodal vulnerabilities, particularly how harmful multimodal tokens triggerjailbreak in MLLMs? Consequently, they remain vulnerable to text-drivenmultimodal jailbreaks, often exhibiting overdefensive behaviors and imposingheavy training overhead.
To bridge this gap, we present an comprehensiveanalysis of where, how and which harmful multimodal tokens bypass safeguards inMLLMs.
Surprisingly, we find that less than 1% tokens in early-middle layersare responsible for inducing unsafe behaviors, highlighting the potential ofprecisely removing a small subset of harmful tokens, without requiring safetytuning, can still effectively improve safety against jailbreaks.
Motivated bythis, we propose Safe Prune-then-Restore (SafePTR), an training-free defenseframework that selectively prunes harmful tokens at vulnerable layers whilerestoring benign features at subsequent layers.
Without incurring additionalcomputational overhead, SafePTR significantly enhances the safety of MLLMswhile preserving efficiency.
Extensive evaluations across three MLLMs and fivebenchmarks demonstrate SafePTR's state-of-the-art performance in mitigatingjailbreak risks without compromising utility.
Published on arXiv on: 2025-07-02T09:22:03Z