Link: http://arxiv.org/abs/2505.21967v1
PDF Link: http://arxiv.org/pdf/2505.21967v1
Summary: Large Vision-Language Models (LVLMs) have shown remarkable capabilitiesacross a wide range of multimodal tasks.
However, their integration of visualinputs introduces expanded attack surfaces, thereby exposing them to novelsecurity vulnerabilities.
In this work, we conduct a systematicrepresentational analysis to uncover why conventional adversarial attacks cancircumvent the safety mechanisms embedded in LVLMs.
We further propose a noveltwo stage evaluation framework for adversarial attacks on LVLMs.
The firststage differentiates among instruction non compliance, outright refusal, andsuccessful adversarial exploitation.
The second stage quantifies the degree towhich the model's output fulfills the harmful intent of the adversarial prompt,while categorizing refusal behavior into direct refusals, soft refusals, andpartial refusals that remain inadvertently helpful.
Finally, we introduce anormative schema that defines idealized model behavior when confronted withharmful prompts, offering a principled target for safety alignment inmultimodal systems.
Published on arXiv on: 2025-05-28T04:43:39Z