Skip to content
arxiv papers 1 min read

Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message

Link: http://arxiv.org/abs/2507.04673v1

PDF Link: http://arxiv.org/pdf/2507.04673v1

Summary: The rise of conversational interfaces has greatly enhanced LLM usability byleveraging dialogue history for sophisticated reasoning.

However, this relianceintroduces an unexplored attack surface.

This paper introduces Trojan HorsePrompting, a novel jailbreak technique.

Adversaries bypass safety mechanisms byforging the model's own past utterances within the conversational historyprovided to its API.

A malicious payload is injected into a model-attributedmessage, followed by a benign user prompt to trigger harmful contentgeneration.

This vulnerability stems from Asymmetric Safety Alignment: modelsare extensively trained to refuse harmful user requests but lack comparableskepticism towards their own purported conversational history.

This implicittrust in its "past" creates a high-impact vulnerability.

Experimentalvalidation on Google's Gemini-2.

0-flash-preview-image-generation shows TrojanHorse Prompting achieves a significantly higher Attack Success Rate (ASR) thanestablished user-turn jailbreaking methods.

These findings reveal a fundamentalflaw in modern conversational AI security, necessitating a paradigm shift frominput-level filtering to robust, protocol-level validation of conversationalcontext integrity.

Published on arXiv on: 2025-07-07T05:35:21Z