Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models

Link: http://arxiv.org/abs/2503.09446v1

PDF Link: http://arxiv.org/pdf/2503.09446v1

Summary: Text-to-image (T2I) diffusion models have achieved remarkable progress ingenerating high-quality images but also raise people's concerns aboutgenerating harmful or misleading content.

While extensive approaches have beenproposed to erase unwanted concepts without requiring retraining from scratch,they inadvertently degrade performance on normal generation tasks.

In thiswork, we propose Interpret then Deactivate (ItD), a novel framework to enableprecise concept removal in T2I diffusion models while preserving overallperformance.

ItD first employs a sparse autoencoder (SAE) to interpret eachconcept as a combination of multiple features.

By permanently deactivating thespecific features associated with target concepts, we repurpose SAE as azero-shot classifier that identifies whether the input prompt includes targetconcepts, allowing selective concept erasure in diffusion models.

Moreover, wedemonstrate that ItD can be easily extended to erase multiple concepts withoutrequiring further training.

Comprehensive experiments across celebrityidentities, artistic styles, and explicit content demonstrate ItD'seffectiveness in eliminating targeted concepts without interfering with normalconcept generation.

Additionally, ItD is also robust against adversarialprompts designed to circumvent content filters.

Code is available at:https://github.

com/NANSirun/Interpret-then-deactivate.

Published on arXiv on: 2025-03-12T14:46:40Z