Link: http://arxiv.org/abs/2507.21820v1
PDF Link: http://arxiv.org/pdf/2507.21820v1
Summary: Despite significant advancements in alignment and content moderation, largelanguage models (LLMs) and text-to-image (T2I) systems remain vulnerable toprompt-based attacks known as jailbreaks.
Unlike traditional adversarialexamples requiring expert knowledge, many of today's jailbreaks are low-effort,high-impact crafted by everyday users with nothing more than cleverly wordedprompts.
This paper presents a systems-style investigation into how non-expertsreliably circumvent safety mechanisms through techniques such as multi-turnnarrative escalation, lexical camouflage, implication chaining, fictionalimpersonation, and subtle semantic edits.
We propose a unified taxonomy ofprompt-level jailbreak strategies spanning both text-output and T2I models,grounded in empirical case studies across popular APIs.
Our analysis revealsthat every stage of the moderation pipeline, from input filtering to outputvalidation, can be bypassed with accessible strategies.
We conclude byhighlighting the urgent need for context-aware defenses that reflect the easewith which these jailbreaks can be reproduced in real-world settings.
Published on arXiv on: 2025-07-29T13:55:23Z