Link: http://arxiv.org/abs/2507.02844v1
PDF Link: http://arxiv.org/pdf/2507.02844v1
Summary: With the emergence of strong visual-language capabilities, multimodal largelanguage models (MLLMs) have demonstrated tremendous potential for real-worldapplications.
However, the security vulnerabilities exhibited by the visualmodality pose significant challenges to deploying such models in open-worldenvironments.
Recent studies have successfully induced harmful responses fromtarget MLLMs by encoding harmful textual semantics directly into visual inputs.
However, in these approaches, the visual modality primarily serves as a triggerfor unsafe behavior, often exhibiting semantic ambiguity and lacking groundingin realistic scenarios.
In this work, we define a novel setting: visual-centricjailbreak, where visual information serves as a necessary component inconstructing a complete and realistic jailbreak context.
Building on thissetting, we propose the VisCo (Visual Contextual) Attack.
VisCo fabricatescontextual dialogue using four distinct visual-focused strategies, dynamicallygenerating auxiliary images when necessary to construct a visual-centricjailbreak scenario.
To maximize attack effectiveness, it incorporates automatictoxicity obfuscation and semantic refinement to produce a final attack promptthat reliably triggers harmful responses from the target black-box MLLMs.
Specifically, VisCo achieves a toxicity score of 4.
78 and an Attack SuccessRate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperformingthe baseline, which performs a toxicity score of 2.
48 and an ASR of 22.
2%.
Thecode is available at https://github.
com/Dtc7w3PQ/Visco-Attack.
Published on arXiv on: 2025-07-03T17:53:12Z