The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning

Link: http://arxiv.org/abs/2504.21307v1

PDF Link: http://arxiv.org/pdf/2504.21307v1

Summary: Despite the remarkable generalization capabilities of diffusion models,recent studies have shown that these models can memorize and generate harmfulcontent when prompted with specific text instructions.

Although fine-tuningapproaches have been developed to mitigate this issue by unlearning harmfulconcepts, these methods can be easily circumvented through jailbreakingattacks.

This indicates that the harmful concept has not been fully erased fromthe model.

However, existing attack methods, while effective, lackinterpretability regarding why unlearned models still retain the concept,thereby hindering the development of defense strategies.

In this work, weaddress these limitations by proposing an attack method that learns anorthogonal set of interpretable attack token embeddings.

The attack tokenembeddings can be decomposed into human-interpretable textual elements,revealing that unlearned models still retain the target concept throughimplicit textual components.

Furthermore, these attack token embeddings arerobust and transferable across text prompts, initial noises, and unlearnedmodels.

Finally, leveraging this diverse set of embeddings, we design a defensemethod applicable to both our proposed attack and existing attack methods.

Experimental results demonstrate the effectiveness of both our attack anddefense strategies.

Published on arXiv on: 2025-04-30T04:33:43Z