Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs

Link: http://arxiv.org/abs/2503.06989v1

PDF Link: http://arxiv.org/pdf/2503.06989v1

Summary: Recently, Multimodal Large Language Models (MLLMs) have demonstrated theirsuperior ability in understanding multimodal contents.

However, they remainvulnerable to jailbreak attacks, which exploit weaknesses in their safetyalignment to generate harmful responses.

Previous studies categorize jailbreaksas successful or failed based on whether responses contain malicious content.

However, given the stochastic nature of MLLM responses, this binaryclassification of an input's ability to jailbreak MLLMs is inappropriate.

Derived from this viewpoint, we introduce jailbreak probability to quantify thejailbreak potential of an input, which represents the likelihood that MLLMsgenerated a malicious response when prompted with this input.

We approximatethis probability through multiple queries to MLLMs.

After modeling therelationship between input hidden states and their corresponding jailbreakprobability using Jailbreak Probability Prediction Network (JPPN), we usecontinuous jailbreak probability for optimization.

Specifically, we proposeJailbreak-Probability-based Attack (JPA) that optimizes adversarialperturbations on inputs to maximize jailbreak probability.

To counteractattacks, we also propose two defensive methods: Jailbreak-Probability-basedFinetuning (JPF) and Jailbreak-Probability-based Defensive Noise (JPDN), whichminimizes jailbreak probability in the MLLM parameters and input space,respectively.

Extensive experiments show that (1) JPA yields improvements (upto 28.

38\%) under both white and black box settings compared to previousmethods with small perturbation bounds and few iterations.

(2) JPF and JPDNsignificantly reduce jailbreaks by at most over 60\%.

Both of the above resultsdemonstrate the significance of introducing jailbreak probability to makenuanced distinctions among input jailbreak abilities.

Published on arXiv on: 2025-03-10T07:10:38Z