Link: http://arxiv.org/abs/2412.00357v1
PDF Link: http://arxiv.org/pdf/2412.00357v1
Summary: Fine-tuning text-to-image diffusion models is widely used for personalizationand adaptation for new domains.
In this paper, we identify a criticalvulnerability of fine-tuning: safety alignment methods designed to filterharmful content (e.
g.
, nudity) can break down during fine-tuning, allowingpreviously suppressed content to resurface, even when using benign datasets.
While this "fine-tuning jailbreaking" issue is known in large language models,it remains largely unexplored in text-to-image diffusion models.
Ourinvestigation reveals that standard fine-tuning can inadvertently undo safetymeasures, causing models to relearn harmful concepts that were previouslyremoved and even exacerbate harmful behaviors.
To address this issue, wepresent a novel but immediate solution called Modular LoRA, which involvestraining Safety Low-Rank Adaptation (LoRA) modules separately from Fine-TuningLoRA components and merging them during inference.
This method effectivelyprevents the re-learning of harmful content without compromising the model'sperformance on new tasks.
Our experiments demonstrate that Modular LoRAoutperforms traditional fine-tuning methods in maintaining safety alignment,offering a practical approach for enhancing the security of text-to-imagediffusion models against potential attacks.
Published on arXiv on: 2024-11-30T04:37:38Z