Link: http://arxiv.org/abs/2412.18123v1
PDF Link: http://arxiv.org/pdf/2412.18123v1
Summary: As text-to-image (T2I) models continue to advance and gain widespreadadoption, their associated safety issues are becoming increasingly prominent.
Malicious users often exploit these models to generate Not-Safe-for-Work (NSFW)images using harmful or adversarial prompts, highlighting the critical need forrobust safeguards to ensure the integrity and compliance of model outputs.
Current internal safeguards frequently degrade image quality, while externaldetection methods often suffer from low accuracy and inefficiency.
In this paper, we introduce AEIOU, a defense framework that is Adaptable,Efficient, Interpretable, Optimizable, and Unified against NSFW prompts in T2Imodels.
AEIOU extracts NSFW features from the hidden states of the model's textencoder, utilizing the separable nature of these features to detect NSFWprompts.
The detection process is efficient, requiring minimal inference time.
AEIOU also offers real-time interpretation of results and supports optimizationthrough data augmentation techniques.
The framework is versatile, accommodatingvarious T2I architectures.
Our extensive experiments show that AEIOUsignificantly outperforms both commercial and open-source moderation tools,achieving over 95% accuracy across all datasets and improving efficiency by atleast tenfold.
It effectively counters adaptive attacks and excels in few-shotand multi-label scenarios.
Published on arXiv on: 2024-12-24T03:17:45Z