TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models

Link: http://arxiv.org/abs/2503.07389v1

PDF Link: http://arxiv.org/pdf/2503.07389v1

Summary: Recent advances in text-to-image diffusion models enable photorealistic imagegeneration, but they also risk producing malicious content, such as NSFWimages.

To mitigate risk, concept erasure methods are studied to facilitate themodel to unlearn specific concepts.

However, current studies struggle to fullyerase malicious concepts implicitly embedded in prompts (e.

g.

, metaphoricalexpressions or adversarial prompts) while preserving the model's normalgeneration capability.

To address this challenge, our study proposes TRCE,using a two-stage concept erasure strategy to achieve an effective trade-offbetween reliable erasure and knowledge preservation.

Firstly, TRCE starts byerasing the malicious semantics implicitly embedded in textual prompts.

Byidentifying a critical mapping objective(i.

e.

, the [EoT] embedding), weoptimize the cross-attention layers to map malicious prompts to contextuallysimilar prompts but with safe concepts.

This step prevents the model from beingoverly influenced by malicious semantics during the denoising process.

Following this, considering the deterministic properties of the samplingtrajectory of the diffusion model, TRCE further steers the early denoisingprediction toward the safe direction and away from the unsafe one throughcontrastive learning, thus further avoiding the generation of maliciouscontent.

Finally, we conduct comprehensive evaluations of TRCE on multiplemalicious concept erasure benchmarks, and the results demonstrate itseffectiveness in erasing malicious concepts while better preserving the model'soriginal generation ability.

The code is available at:http://github.

com/ddgoodgood/TRCE.

CAUTION: This paper includes model-generatedcontent that may contain offensive material.

Published on arXiv on: 2025-03-10T14:37:53Z