Link: http://arxiv.org/abs/2505.16446v1
PDF Link: http://arxiv.org/pdf/2505.16446v1
Summary: Multimodal large language models (MLLMs) enable powerful cross-modalreasoning capabilities.
However, the expanded input space introduces new attacksurfaces.
Previous jailbreak attacks often inject malicious instructions fromtext into less aligned modalities, such as vision.
As MLLMs increasinglyincorporate cross-modal consistency and alignment mechanisms, such explicitattacks become easier to detect and block.
In this work, we propose a novelimplicit jailbreak framework termed IJA that stealthily embeds maliciousinstructions into images via least significant bit steganography and couplesthem with seemingly benign, image-related textual prompts.
To further enhanceattack effectiveness across diverse MLLMs, we incorporate adversarial suffixesgenerated by a surrogate model and introduce a template optimization modulethat iteratively refines both the prompt and embedding based on model feedback.
On commercial models like GPT-4o and Gemini-1.
5 Pro, our method achieves attacksuccess rates of over 90% using an average of only 3 queries.
Published on arXiv on: 2025-05-22T09:34:47Z