Skip to content
arxiv papers 1 min read

Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Link: http://arxiv.org/abs/2501.02029v1

PDF Link: http://arxiv.org/pdf/2501.02029v1

Summary: With the integration of an additional modality, large vision-language models(LVLMs) exhibit greater vulnerability to safety risks (e.

g.

, jailbreaking)compared to their language-only predecessors.

Although recent studies havedevoted considerable effort to the post-hoc alignment of LVLMs, the innersafety mechanisms remain largely unexplored.

In this paper, we discover thatinternal activations of LVLMs during the first token generation can effectivelyidentify malicious prompts across different attacks.

This inherent safetyperception is governed by sparse attention heads, which we term ``safetyheads.

" Further analysis reveals that these heads act as specialized shieldsagainst malicious prompts; ablating them leads to higher attack success rates,while the model's utility remains unaffected.

By locating these safety headsand concatenating their activations, we construct a straightforward butpowerful malicious prompt detector that integrates seamlessly into thegeneration process with minimal extra inference overhead.

Despite its simplestructure of a logistic regression model, the detector surprisingly exhibitsstrong zero-shot generalization capabilities.

Experiments across variousprompt-based attacks confirm the effectiveness of leveraging safety heads toprotect LVLMs.

Code is available at \url{https://github.

com/Ziwei-Zheng/SAHs}.

Published on arXiv on: 2025-01-03T07:01:15Z