Link: http://arxiv.org/abs/2501.02629v1
PDF Link: http://arxiv.org/pdf/2501.02629v1
Summary: As large language models (LLMs) are increasingly deployed in diverseapplications, including chatbot assistants and code generation, aligning theirbehavior with safety and ethical standards has become paramount.
However,jailbreak attacks, which exploit vulnerabilities to elicit unintended orharmful outputs, threaten LLMs' safety significantly.
In this paper, weintroduce Layer-AdvPatcher, a novel methodology designed to defend againstjailbreak attacks by utilizing an unlearning strategy to patch specific layerswithin LLMs through self-augmented datasets.
Our insight is that certainlayer(s), tend to produce affirmative tokens when faced with harmful prompts.
By identifying these layers and adversarially exposing them to generate moreharmful data, one can understand their inherent and diverse vulnerabilities toattacks.
With these exposures, we then "unlearn" these issues, reducing theimpact of affirmative tokens and hence minimizing jailbreak risks while keepingthe model's responses to safe queries intact.
We conduct extensive experimentson two models, four benchmark datasets, and multiple state-of-the-art jailbreakbenchmarks to demonstrate the efficacy of our approach.
Results indicate thatour framework reduces the harmfulness and attack success rate of jailbreakattacks without compromising utility for benign queries compared to recentdefense methods.
Published on arXiv on: 2025-01-05T19:06:03Z