Link: http://arxiv.org/abs/2505.20841v1
PDF Link: http://arxiv.org/pdf/2505.20841v1
Summary: As large language models (LLMs) grow more capable, concerns about their safedeployment have also grown.
Although alignment mechanisms have been introducedto deter misuse, they remain vulnerable to carefully designed adversarialprompts.
In this work, we present a scalable attack strategy: intent-hidingadversarial prompting, which conceals malicious intent through the compositionof skills.
We develop a game-theoretic framework to model the interactionbetween such attacks and defense systems that apply both prompt and responsefiltering.
Our analysis identifies equilibrium points and reveals structuraladvantages for the attacker.
To counter these threats, we propose and analyze adefense mechanism tailored to intent-hiding attacks.
Empirically, we validatethe attack's effectiveness on multiple real-world LLMs across a range ofmalicious behaviors, demonstrating clear advantages over existing adversarialprompting techniques.
Published on arXiv on: 2025-05-27T07:59:56Z