Link: http://arxiv.org/abs/2502.14486v1
PDF Link: http://arxiv.org/pdf/2502.14486v1
Summary: Jailbreak attacks, where harmful prompts bypass generative models' built-insafety, raise serious concerns about model vulnerability.
While many defensemethods have been proposed, the trade-offs between safety and helpfulness, andtheir application to Large Vision-Language Models (LVLMs), are not wellunderstood.
This paper systematically examines jailbreak defenses by reframingthe standard generation task as a binary classification problem to assess modelrefusal tendencies for both harmful and benign queries.
We identify two keydefense mechanisms: safety shift, which increases refusal rates across allqueries, and harmfulness discrimination, which improves the model's ability todistinguish between harmful and benign inputs.
Using these mechanisms, wedevelop two ensemble defense strategies-inter-mechanism ensembles andintra-mechanism ensembles-to balance safety and helpfulness.
Experiments on theMM-SafetyBench and MOSSBench datasets with LLaVA-1.
5 models show that thesestrategies effectively improve model safety or optimize the trade-off betweensafety and helpfulness.
Published on arXiv on: 2025-02-20T12:07:40Z