Link: http://arxiv.org/abs/2503.08195v1
PDF Link: http://arxiv.org/pdf/2503.08195v1
Summary: Large language models (LLMs) have demonstrated significant utility in a widerange of applications; however, their deployment is plagued by securityvulnerabilities, notably jailbreak attacks.
These attacks manipulate LLMs togenerate harmful or unethical content by crafting adversarial prompts.
Whilemuch of the current research on jailbreak attacks has focused on single-turninteractions, it has largely overlooked the impact of historical dialogues onmodel behavior.
In this paper, we introduce a novel jailbreak paradigm,Dialogue Injection Attack (DIA), which leverages the dialogue history toenhance the success rates of such attacks.
DIA operates in a black-box setting,requiring only access to the chat API or knowledge of the LLM's chat template.
We propose two methods for constructing adversarial historical dialogues: oneadapts gray-box prefilling attacks, and the other exploits deferred responses.
Our experiments show that DIA achieves state-of-the-art attack success rates onrecent LLMs, including Llama-3.
1 and GPT-4o.
Additionally, we demonstrate thatDIA can bypass 5 different defense mechanisms, highlighting its robustness andeffectiveness.
Published on arXiv on: 2025-03-11T09:00:45Z