Skip to content
arxiv papers 1 min read

Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models

Link: http://arxiv.org/abs/2412.00357v1

PDF Link: http://arxiv.org/pdf/2412.00357v1

Summary: Fine-tuning text-to-image diffusion models is widely used for personalizationand adaptation for new domains.

In this paper, we identify a criticalvulnerability of fine-tuning: safety alignment methods designed to filterharmful content (e.

g.

, nudity) can break down during fine-tuning, allowingpreviously suppressed content to resurface, even when using benign datasets.

While this "fine-tuning jailbreaking" issue is known in large language models,it remains largely unexplored in text-to-image diffusion models.

Ourinvestigation reveals that standard fine-tuning can inadvertently undo safetymeasures, causing models to relearn harmful concepts that were previouslyremoved and even exacerbate harmful behaviors.

To address this issue, wepresent a novel but immediate solution called Modular LoRA, which involvestraining Safety Low-Rank Adaptation (LoRA) modules separately from Fine-TuningLoRA components and merging them during inference.

This method effectivelyprevents the re-learning of harmful content without compromising the model'sperformance on new tasks.

Our experiments demonstrate that Modular LoRAoutperforms traditional fine-tuning methods in maintaining safety alignment,offering a practical approach for enhancing the security of text-to-imagediffusion models against potential attacks.

Published on arXiv on: 2024-11-30T04:37:38Z