Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

Link: http://arxiv.org/abs/2505.16446v1

PDF Link: http://arxiv.org/pdf/2505.16446v1

Summary: Multimodal large language models (MLLMs) enable powerful cross-modalreasoning capabilities.

However, the expanded input space introduces new attacksurfaces.

Previous jailbreak attacks often inject malicious instructions fromtext into less aligned modalities, such as vision.

As MLLMs increasinglyincorporate cross-modal consistency and alignment mechanisms, such explicitattacks become easier to detect and block.

In this work, we propose a novelimplicit jailbreak framework termed IJA that stealthily embeds maliciousinstructions into images via least significant bit steganography and couplesthem with seemingly benign, image-related textual prompts.

To further enhanceattack effectiveness across diverse MLLMs, we incorporate adversarial suffixesgenerated by a surrogate model and introduce a template optimization modulethat iteratively refines both the prompt and embedding based on model feedback.

On commercial models like GPT-4o and Gemini-1.

5 Pro, our method achieves attacksuccess rates of over 90% using an average of only 3 queries.

Published on arXiv on: 2025-05-22T09:34:47Z