Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Link: http://arxiv.org/abs/2507.02844v1

PDF Link: http://arxiv.org/pdf/2507.02844v1

Summary: With the emergence of strong visual-language capabilities, multimodal largelanguage models (MLLMs) have demonstrated tremendous potential for real-worldapplications.

However, the security vulnerabilities exhibited by the visualmodality pose significant challenges to deploying such models in open-worldenvironments.

Recent studies have successfully induced harmful responses fromtarget MLLMs by encoding harmful textual semantics directly into visual inputs.

However, in these approaches, the visual modality primarily serves as a triggerfor unsafe behavior, often exhibiting semantic ambiguity and lacking groundingin realistic scenarios.

In this work, we define a novel setting: visual-centricjailbreak, where visual information serves as a necessary component inconstructing a complete and realistic jailbreak context.

Building on thissetting, we propose the VisCo (Visual Contextual) Attack.

VisCo fabricatescontextual dialogue using four distinct visual-focused strategies, dynamicallygenerating auxiliary images when necessary to construct a visual-centricjailbreak scenario.

To maximize attack effectiveness, it incorporates automatictoxicity obfuscation and semantic refinement to produce a final attack promptthat reliably triggers harmful responses from the target black-box MLLMs.

Specifically, VisCo achieves a toxicity score of 4.

78 and an Attack SuccessRate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperformingthe baseline, which performs a toxicity score of 2.

48 and an ASR of 22.

2%.

Thecode is available at https://github.

com/Dtc7w3PQ/Visco-Attack.

Published on arXiv on: 2025-07-03T17:53:12Z