Link: http://arxiv.org/abs/2508.15182v1
PDF Link: http://arxiv.org/pdf/2508.15182v1
Summary: Jailbreak attacks pose a serious threat to the safety of Large LanguageModels (LLMs) by crafting adversarial prompts that bypass alignment mechanisms,causing the models to produce harmful, restricted, or biased content.
In thispaper, we propose SafeLLM, a novel unlearning-based defense framework thatunlearn the harmful knowledge from LLMs while preserving linguistic fluency andgeneral capabilities.
SafeLLM employs a three-stage pipeline: (1) dynamicunsafe output detection using a hybrid approach that integrates externalclassifiers with model-internal evaluations; (2) token-level harmful contenttracing through feedforward network (FFN) activations to localize harmfulknowledge; and (3) constrained optimization to suppress unsafe behavior withoutdegrading overall model quality.
SafeLLM achieves targeted and irreversibleforgetting by identifying and neutralizing FFN substructures responsible forharmful generation pathways.
Extensive experiments on prominent LLMs (Vicuna,LLaMA, and GPT-J) across multiple jailbreak benchmarks show that SafeLLMsubstantially reduces attack success rates while maintaining highgeneral-purpose performance.
Compared to standard defense methods such assupervised fine-tuning and direct preference optimization, SafeLLM offersstronger safety guarantees, more precise control over harmful behavior, andgreater robustness to unseen attacks.
Moreover, SafeLLM maintains the generalperformance after the harmful knowledge unlearned.
These results highlightunlearning as a promising direction for scalable and effective LLM safety.
Published on arXiv on: 2025-08-21T02:39:14Z