Link: http://arxiv.org/abs/2511.16110v1
PDF Link: http://arxiv.org/pdf/2511.16110v1
Summary: The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation.
However, the real-world robustness of these defenses against adversarial attacks remains underexplored.
We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4.
The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives.
We provide a theoretical perspective based on reward hacking to explain why this attack succeeds.
To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning.
Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability.
Overall, MFA achieves a 58.
5% success rate and consistently outperforms existing methods.
On state-of-the-art commercial models, MFA reaches a 52.
8% success rate, surpassing the second-best attack by 34%.
These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs.
Code: https://github.
com/cure-lab/MultiFacetedAttack
Published on arXiv on: 2025-11-20T07:12:54Z