JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation

Link: http://arxiv.org/abs/2502.07557v1

PDF Link: http://arxiv.org/pdf/2502.07557v1

Summary: Despite the implementation of safety alignment strategies, large languagemodels (LLMs) remain vulnerable to jailbreak attacks, which undermine thesesafety guardrails and pose significant security threats.

Some defenses havebeen proposed to detect or mitigate jailbreaks, but they are unable towithstand the test of time due to an insufficient understanding of jailbreakmechanisms.

In this work, we investigate the mechanisms behind jailbreaks basedon the Linear Representation Hypothesis (LRH), which states that neuralnetworks encode high-level concepts as subspaces in their hiddenrepresentations.

We define the toxic semantics in harmful and jailbreak promptsas toxic concepts and describe the semantics in jailbreak prompts thatmanipulate LLMs to comply with unsafe requests as jailbreak concepts.

Throughconcept extraction and analysis, we reveal that LLMs can recognize the toxicconcepts in both harmful and jailbreak prompts.

However, unlike harmfulprompts, jailbreak prompts activate the jailbreak concepts and alter the LLMoutput from rejection to compliance.

Building on our analysis, we propose acomprehensive jailbreak defense framework, JBShield, consisting of two keycomponents: jailbreak detection JBShield-D and mitigation JBShield-M.

JBShield-D identifies jailbreak prompts by determining whether the inputactivates both toxic and jailbreak concepts.

When a jailbreak prompt isdetected, JBShield-M adjusts the hidden representations of the target LLM byenhancing the toxic concept and weakening the jailbreak concept, ensuring LLMsproduce safe content.

Extensive experiments demonstrate the superiorperformance of JBShield, achieving an average detection accuracy of 0.

95 andreducing the average attack success rate of various jailbreak attacks to 2%from 61% across distinct LLMs.

Published on arXiv on: 2025-02-11T13:50:50Z