Link: http://arxiv.org/abs/2507.21540v1
PDF Link: http://arxiv.org/pdf/2507.21540v1
Summary: The increasing sophistication of large vision-language models (LVLMs) hasbeen accompanied by advances in safety alignment mechanisms designed to preventharmful content generation.
However, these defenses remain vulnerable tosophisticated adversarial attacks.
Existing jailbreak methods typically rely ondirect and semantically explicit prompts, overlooking subtle vulnerabilities inhow LVLMs compose information over multiple reasoning steps.
In this paper, wepropose a novel and effective jailbreak framework inspired by Return-OrientedProgramming (ROP) techniques from software security.
Our approach decomposes aharmful instruction into a sequence of individually benign visual gadgets.
Acarefully engineered textual prompt directs the sequence of inputs, promptingthe model to integrate the benign visual gadgets through its reasoning processto produce a coherent and harmful output.
This makes the malicious intentemergent and difficult to detect from any single component.
We validate ourmethod through extensive experiments on established benchmarks includingSafeBench and MM-SafetyBench, targeting popular LVLMs.
Results show that ourapproach consistently and substantially outperforms existing baselines onstate-of-the-art models, achieving near-perfect attack success rates (over 0.
90on SafeBench) and improving ASR by up to 0.
39.
Our findings reveal a criticaland underexplored vulnerability that exploits the compositional reasoningabilities of LVLMs, highlighting the urgent need for defenses that secure theentire reasoning process.
Published on arXiv on: 2025-07-29T07:13:56Z