Skip to content
arxiv papers 1 min read

How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

Link: http://arxiv.org/abs/2502.14486v1

PDF Link: http://arxiv.org/pdf/2502.14486v1

Summary: Jailbreak attacks, where harmful prompts bypass generative models' built-insafety, raise serious concerns about model vulnerability.

While many defensemethods have been proposed, the trade-offs between safety and helpfulness, andtheir application to Large Vision-Language Models (LVLMs), are not wellunderstood.

This paper systematically examines jailbreak defenses by reframingthe standard generation task as a binary classification problem to assess modelrefusal tendencies for both harmful and benign queries.

We identify two keydefense mechanisms: safety shift, which increases refusal rates across allqueries, and harmfulness discrimination, which improves the model's ability todistinguish between harmful and benign inputs.

Using these mechanisms, wedevelop two ensemble defense strategies-inter-mechanism ensembles andintra-mechanism ensembles-to balance safety and helpfulness.

Experiments on theMM-SafetyBench and MOSSBench datasets with LLaVA-1.

5 models show that thesestrategies effectively improve model safety or optimize the trade-off betweensafety and helpfulness.

Published on arXiv on: 2025-02-20T12:07:40Z