HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

Link: http://arxiv.org/abs/2506.04704v1

PDF Link: http://arxiv.org/pdf/2506.04704v1

Summary: Despite emerging efforts to enhance the safety of Vision-Language Models(VLMs), current approaches face two main shortcomings.

1) Existingsafety-tuning datasets and benchmarks only partially consider how image-textinteractions can yield harmful content, often overlooking contextually unsafeoutcomes from seemingly benign pairs.

This narrow coverage leaves VLMsvulnerable to jailbreak attacks in unseen configurations.

2) Prior methods relyprimarily on data-centric tuning, with limited architectural innovations tointrinsically strengthen safety.

We address these gaps by introducing aholistic safety dataset and benchmark, HoliSafe, that spans all fivesafe/unsafe image-text combinations, providing a more robust basis for bothtraining and evaluation.

We further propose SafeLLaVA, a novel VLM augmentedwith a learnable safety meta token and a dedicated safety head.

The meta tokenencodes harmful visual cues during training, intrinsically guiding the languagemodel toward safer responses, while the safety head offers interpretableharmfulness classification aligned with refusal rationales.

Experiments showthat SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safetyperformance across multiple VLM benchmarks.

Additionally, the HoliSafebenchmark itself reveals critical vulnerabilities in existing models.

We hopethat HoliSafe and SafeLLaVA will spur further research into robust andinterpretable VLM safety, expanding future avenues for multimodal alignment.

Published on arXiv on: 2025-06-05T07:26:34Z