Link: http://arxiv.org/abs/2509.07617v1
PDF Link: http://arxiv.org/pdf/2509.07617v1
Summary: Direct Prompt Injection (DPI) attacks pose a critical security threat toLarge Language Models (LLMs) due to their low barrier of execution and highpotential damage.
To address the impracticality of existing white-box/gray-boxmethods and the poor transferability of black-box methods, we propose anactivations-guided prompt injection attack framework.
We first construct anEnergy-based Model (EBM) using activations from a surrogate model to evaluatethe quality of adversarial prompts.
Guided by the trained EBM, we employ thetoken-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimizeadversarial prompts, thereby enabling gradient-free black-box attacks.
Experimental results demonstrate our superior cross-model transferability,achieving 49.
6% attack success rate (ASR) across five mainstream LLMs and 34.
6%improvement over human-crafted prompts, and maintaining 36.
6% ASR on unseentask scenarios.
Interpretability analysis reveals a correlation betweenactivations and attack effectiveness, highlighting the critical role ofsemantic patterns in transferable vulnerability exploitation.
Published on arXiv on: 2025-09-09T11:42:06Z