Skip to content
arxiv papers 1 min read

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Link: http://arxiv.org/abs/2412.17034v1

PDF Link: http://arxiv.org/pdf/2412.17034v1

Summary: Jailbreaking in Large Language Models (LLMs) is a major security concern asit can deceive LLMs to generate harmful text.

Yet, there is still insufficientunderstanding of how jailbreaking works, which makes it hard to developeffective defense strategies.

We aim to shed more light into this issue: weconduct a detailed large-scale analysis of seven different jailbreak methodsand find that these disagreements stem from insufficient observation samples.

In particular, we introduce \textit{safety boundary}, and we find thatjailbreaks shift harmful activations outside that safety boundary, where LLMsare less sensitive to harmful information.

We also find that the low and themiddle layers are critical in such shifts, while deeper layers have lessimpact.

Leveraging on these insights, we propose a novel defense called\textbf{Activation Boundary Defense} (ABD), which adaptively constrains theactivations within the safety boundary.

We further use Bayesian optimization toselectively apply the defense method to the low and the middle layers.

Ourexperiments on several benchmarks show that ABD achieves an average DSR of over98\% against various forms of jailbreak attacks, with less than 2\% impact onthe model's general capabilities.

Published on arXiv on: 2024-12-22T14:18:39Z