Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

Link: http://arxiv.org/abs/2509.07617v1

PDF Link: http://arxiv.org/pdf/2509.07617v1

Summary: Direct Prompt Injection (DPI) attacks pose a critical security threat toLarge Language Models (LLMs) due to their low barrier of execution and highpotential damage.

To address the impracticality of existing white-box/gray-boxmethods and the poor transferability of black-box methods, we propose anactivations-guided prompt injection attack framework.

We first construct anEnergy-based Model (EBM) using activations from a surrogate model to evaluatethe quality of adversarial prompts.

Guided by the trained EBM, we employ thetoken-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimizeadversarial prompts, thereby enabling gradient-free black-box attacks.

Experimental results demonstrate our superior cross-model transferability,achieving 49.

6% attack success rate (ASR) across five mainstream LLMs and 34.

6%improvement over human-crafted prompts, and maintaining 36.

6% ASR on unseentask scenarios.

Interpretability analysis reveals a correlation betweenactivations and attack effectiveness, highlighting the critical role ofsemantic patterns in transferable vulnerability exploitation.

Published on arXiv on: 2025-09-09T11:42:06Z