Skip to content
arxiv papers 1 min read

Understanding and Rectifying Safety Perception Distortion in VLMs

Link: http://arxiv.org/abs/2502.13095v1

PDF Link: http://arxiv.org/pdf/2502.13095v1

Summary: Recent studies reveal that vision-language models (VLMs) become moresusceptible to harmful requests and jailbreak attacks after integrating thevision modality, exhibiting greater vulnerability than their text-only LLMbackbones.

To uncover the root cause of this phenomenon, we conduct an in-depthanalysis and identify a key issue: multimodal inputs introduce anmodality-induced activation shift toward a "safer" direction compared to theirtext-only counterparts, leading VLMs to systematically overestimate the safetyof harmful inputs.

We refer to this issue as safety perception distortion.

Tomitigate such distortion, we propose Activation Shift Disentanglement andCalibration (ShiftDC), a training-free method that decomposes and calibratesthe modality-induced activation shift to reduce the impact of modality onsafety.

By isolating and removing the safety-relevant component, ShiftDCrestores the inherent safety alignment of the LLM backbone while preserving thevision-language capabilities of VLMs.

Empirical results demonstrate thatShiftDC significantly enhances alignment performance on safety benchmarkswithout impairing model utility.

Published on arXiv on: 2025-02-18T18:06:48Z