T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models

Link: http://arxiv.org/abs/2504.15512v1

PDF Link: http://arxiv.org/pdf/2504.15512v1

Summary: The rapid development of generative artificial intelligence has made text tovideo models essential for building future multimodal world simulators.

However, these models remain vulnerable to jailbreak attacks, where speciallycrafted prompts bypass safety mechanisms and lead to the generation of harmfulor unsafe content.

Such vulnerabilities undermine the reliability and securityof simulation based applications.

In this paper, we propose T2VShield, acomprehensive and model agnostic defense framework designed to protect text tovideo models from jailbreak threats.

Our method systematically analyzes theinput, model, and output stages to identify the limitations of existingdefenses, including semantic ambiguities in prompts, difficulties in detectingmalicious content in dynamic video outputs, and inflexible model centricmitigation strategies.

T2VShield introduces a prompt rewriting mechanism basedon reasoning and multimodal retrieval to sanitize malicious inputs, along witha multi scope detection module that captures local and global inconsistenciesacross time and modalities.

The framework does not require access to internalmodel parameters and works with both open and closed source systems.

Extensiveexperiments on five platforms show that T2VShield can reduce jailbreak successrates by up to 35 percent compared to strong baselines.

We further develop ahuman centered audiovisual evaluation protocol to assess perceptual safety,emphasizing the importance of visual level defense in enhancing thetrustworthiness of next generation multimodal simulators.

Published on arXiv on: 2025-04-22T01:18:42Z