Skip to content
arxiv papers 1 min read

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Link: http://arxiv.org/abs/2507.05248v1

PDF Link: http://arxiv.org/pdf/2507.05248v1

Summary: Contextual priming, where earlier stimuli covertly bias later judgments,offers an unexplored attack surface for large language models (LLMs).

Weuncover a contextual priming vulnerability in which the previous response inthe dialogue can steer its subsequent behavior toward policy-violating content.

Building on this insight, we propose Response Attack, which uses an auxiliaryLLM to generate a mildly harmful response to a paraphrased version of theoriginal malicious query.

They are then formatted into the dialogue andfollowed by a succinct trigger prompt, thereby priming the target model togenerate harmful content.

Across eight open-source and proprietary LLMs, RAconsistently outperforms seven state-of-the-art jailbreak techniques, achievinghigher attack success rates.

To mitigate this threat, we construct and releasea context-aware safety fine-tuning dataset, which significantly reduces theattack success rate while preserving model capabilities.

The code and data areavailable at https://github.

com/Dtc7w3PQ/Response-Attack.

Published on arXiv on: 2025-07-07T17:56:05Z