Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety

Link: http://arxiv.org/abs/2505.04146v1

PDF Link: http://arxiv.org/pdf/2505.04146v1

Summary: Existing large language models (LLMs) are advancing rapidly and produceoutstanding results in image generation tasks, yet their content safety checksremain vulnerable to prompt-based jailbreaks.

Through preliminary testing onplatforms such as ChatGPT, MetaAI, and Grok, we observed that even short,natural prompts could lead to the generation of compromising images rangingfrom realistic depictions of forged documents to manipulated images of publicfigures.

We introduce Unmasking the Canvas (UTC Benchmark; UTCB), a dynamic andscalable benchmark dataset to evaluate LLM vulnerability in image generation.

Our methodology combines structured prompt engineering, multilingualobfuscation (e.

g.

, Zulu, Gaelic, Base64), and evaluation using Groq-hostedLLaMA-3.

The pipeline supports both zero-shot and fallback promptingstrategies, risk scoring, and automated tagging.

All generations are storedwith rich metadata and curated into Bronze (non-verified), Silver (LLM-aidedverification), and Gold (manually verified) tiers.

UTCB is designed to evolveover time with new data sources, prompt templates, and model behaviors.

Warning: This paper includes visual examples of adversarial inputs designedto test model safety.

All outputs have been redacted to ensure responsibledisclosure.

Published on arXiv on: 2025-05-07T05:54:04Z