Retention Score: Quantifying Jailbreak Risks for Vision Language Models

Link: http://arxiv.org/abs/2412.17544v1

PDF Link: http://arxiv.org/pdf/2412.17544v1

Summary: The emergence of Vision-Language Models (VLMs) is a significant advancementin integrating computer vision with Large Language Models (LLMs) to enhancemulti-modal machine learning capabilities.

However, this progress has also madeVLMs vulnerable to sophisticated adversarial attacks, raising concerns abouttheir reliability.

The objective of this paper is to assess the resilience ofVLMs against jailbreak attacks that can compromise model safety compliance andresult in harmful outputs.

To evaluate a VLM's ability to maintain itsrobustness against adversarial input perturbations, we propose a novel metriccalled the \textbf{Retention Score}.

Retention Score is a multi-modalevaluation metric that includes Retention-I and Retention-T scores forquantifying jailbreak risks in visual and textual components of VLMs.

Ourprocess involves generating synthetic image-text pairs using a conditionaldiffusion model.

These pairs are then predicted for toxicity score by a VLMalongside a toxicity judgment classifier.

By calculating the margin in toxicityscores, we can quantify the robustness of the VLM in an attack-agnostic manner.

Our work has four main contributions.

First, we prove that Retention Score canserve as a certified robustness metric.

Second, we demonstrate that most VLMswith visual components are less robust against jailbreak attacks than thecorresponding plain VLMs.

Additionally, we evaluate black-box VLM APIs and findthat the security settings in Google Gemini significantly affect the score androbustness.

Moreover, the robustness of GPT4V is similar to the medium settingsof Gemini.

Finally, our approach offers a time-efficient alternative toexisting adversarial attack methods and provides consistent model robustnessrankings when evaluated on VLMs including MiniGPT-4, InstructBLIP, and LLaVA.

Published on arXiv on: 2024-12-23T13:05:51Z