Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Link: http://arxiv.org/abs/2412.08201v1

PDF Link: http://arxiv.org/pdf/2412.08201v1

Summary: Large Language Models (LLMs) have transformed numerous fields by enablingadvanced natural language interactions but remain susceptible to criticalvulnerabilities, particularly jailbreak attacks.

Current jailbreak techniques,while effective, often depend on input modifications, making them detectableand limiting their stealth and scalability.

This paper presents Targeted ModelEditing (TME), a novel white-box approach that bypasses safety filters byminimally altering internal model structures while preserving the model'sintended functionalities.

TME identifies and removes safety-criticaltransformations (SCTs) embedded in model matrices, enabling malicious queriesto bypass restrictions without input modifications.

By analyzing distinctactivation patterns between safe and unsafe queries, TME isolates andapproximates SCTs through an optimization process.

Implemented in the D-LLMframework, our method achieves an average Attack Success Rate (ASR) of 84.

86%on four mainstream open-source LLMs, maintaining high performance.

Unlikeexisting methods, D-LLM eliminates the need for specific triggers or harmfulresponse collections, offering a stealthier and more effective jailbreakstrategy.

This work reveals a covert and robust threat vector in LLM securityand emphasizes the need for stronger safeguards in model safety alignment.

Published on arXiv on: 2024-12-11T08:44:15Z