Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Link: http://arxiv.org/abs/2501.04931v1

PDF Link: http://arxiv.org/pdf/2501.04931v1

Summary: Multimodal Large Language Models (MLLMs) have achieved impressive performanceand have been put into practical use in commercial applications, but they stillhave potential safety mechanism vulnerabilities.

Jailbreak attacks are redteaming methods that aim to bypass safety mechanisms and discover MLLMs'potential risks.

Existing MLLMs' jailbreak methods often bypass the model'ssafety mechanism through complex optimization methods or carefully designedimage and text prompts.

Despite achieving some progress, they have a low attacksuccess rate on commercial closed-source MLLMs.

Unlike previous research, weempirically find that there exists a Shuffle Inconsistency between MLLMs'comprehension ability and safety ability for the shuffled harmful instruction.

That is, from the perspective of comprehension ability, MLLMs can understandthe shuffled harmful text-image instructions well.

However, they can be easilybypassed by the shuffled harmful instructions from the perspective of safetyability, leading to harmful responses.

Then we innovatively propose atext-image jailbreak attack named SI-Attack.

Specifically, to fully utilize theShuffle Inconsistency and overcome the shuffle randomness, we apply aquery-based black-box optimization method to select the most harmful shuffledinputs based on the feedback of the toxic judge model.

A series of experimentsshow that SI-Attack can improve the attack's performance on three benchmarks.

In particular, SI-Attack can obviously improve the attack success rate forcommercial MLLMs such as GPT-4o or Claude-3.

5-Sonnet.

Published on arXiv on: 2025-01-09T02:47:01Z