Link: http://arxiv.org/abs/2507.05248v1
PDF Link: http://arxiv.org/pdf/2507.05248v1
Summary: Contextual priming, where earlier stimuli covertly bias later judgments,offers an unexplored attack surface for large language models (LLMs).
Weuncover a contextual priming vulnerability in which the previous response inthe dialogue can steer its subsequent behavior toward policy-violating content.
Building on this insight, we propose Response Attack, which uses an auxiliaryLLM to generate a mildly harmful response to a paraphrased version of theoriginal malicious query.
They are then formatted into the dialogue andfollowed by a succinct trigger prompt, thereby priming the target model togenerate harmful content.
Across eight open-source and proprietary LLMs, RAconsistently outperforms seven state-of-the-art jailbreak techniques, achievinghigher attack success rates.
To mitigate this threat, we construct and releasea context-aware safety fine-tuning dataset, which significantly reduces theattack success rate while preserving model capabilities.
The code and data areavailable at https://github.
com/Dtc7w3PQ/Response-Attack.
Published on arXiv on: 2025-07-07T17:56:05Z