Link: http://arxiv.org/abs/2507.04673v1
PDF Link: http://arxiv.org/pdf/2507.04673v1
Summary: The rise of conversational interfaces has greatly enhanced LLM usability byleveraging dialogue history for sophisticated reasoning.
However, this relianceintroduces an unexplored attack surface.
This paper introduces Trojan HorsePrompting, a novel jailbreak technique.
Adversaries bypass safety mechanisms byforging the model's own past utterances within the conversational historyprovided to its API.
A malicious payload is injected into a model-attributedmessage, followed by a benign user prompt to trigger harmful contentgeneration.
This vulnerability stems from Asymmetric Safety Alignment: modelsare extensively trained to refuse harmful user requests but lack comparableskepticism towards their own purported conversational history.
This implicittrust in its "past" creates a high-impact vulnerability.
Experimentalvalidation on Google's Gemini-2.
0-flash-preview-image-generation shows TrojanHorse Prompting achieves a significantly higher Attack Success Rate (ASR) thanestablished user-turn jailbreaking methods.
These findings reveal a fundamentalflaw in modern conversational AI security, necessitating a paradigm shift frominput-level filtering to robust, protocol-level validation of conversationalcontext integrity.
Published on arXiv on: 2025-07-07T05:35:21Z