DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing

Link: http://arxiv.org/abs/2502.11647v1

PDF Link: http://arxiv.org/pdf/2502.11647v1

Summary: Large Language Models (LLMs) are widely applied in decision making, but theirdeployment is threatened by jailbreak attacks, where adversarial usersmanipulate model behavior to bypass safety measures.

Existing defensemechanisms, such as safety fine-tuning and model editing, either requireextensive parameter modifications or lack precision, leading to performancedegradation on general tasks, which is unsuitable to post-deployment safetyalignment.

To address these challenges, we propose DELMAN (Dynamic Editing forLLMs JAilbreak DefeNse), a novel approach leveraging direct model editing forprecise, dynamic protection against jailbreak attacks.

DELMAN directly updatesa minimal set of relevant parameters to neutralize harmful behaviors whilepreserving the model's utility.

To avoid triggering a safe response in benigncontext, we incorporate KL-divergence regularization to ensure the updatedmodel remains consistent with the original model when processing benignqueries.

Experimental results demonstrate that DELMAN outperforms baselinemethods in mitigating jailbreak attacks while preserving the model's utility,and adapts seamlessly to new attack instances, providing a practical andefficient solution for post-deployment model protection.

Published on arXiv on: 2025-02-17T10:39:21Z