Skip to content
arxiv papers 1 min read

NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation

Link: http://arxiv.org/abs/2506.18325v1

PDF Link: http://arxiv.org/pdf/2506.18325v1

Summary: The rapid advancement of text-to-image (T2I) models, such as StableDiffusion, has enhanced their capability to synthesize images from textualprompts.

However, this progress also raises significant risks of misuse,including the generation of harmful content (e.

g.

, pornography, violence,discrimination), which contradicts the ethical goals of T2I technology andhinders its sustainable development.

Inspired by "jailbreak" attacks in largelanguage models, which bypass restrictions through subtle prompt modifications,this paper proposes NSFW-Classifier Guided Prompt Sanitization (PromptSan), anovel approach to detoxify harmful prompts without altering model architectureor degrading generation capability.

PromptSan includes two variants:PromptSan-Modify, which iteratively identifies and replaces harmful tokens ininput prompts using text NSFW classifiers during inference, andPromptSan-Suffix, which trains an optimized suffix token sequence to neutralizeharmful intent while passing both text and image NSFW classifier checks.

Extensive experiments demonstrate that PromptSan achieves state-of-the-artperformance in reducing harmful content generation across multiple metrics,effectively balancing safety and usability.

Published on arXiv on: 2025-06-23T06:17:30Z