Concealment of Intent: A Game-Theoretic Analysis

Link: http://arxiv.org/abs/2505.20841v1

PDF Link: http://arxiv.org/pdf/2505.20841v1

Summary: As large language models (LLMs) grow more capable, concerns about their safedeployment have also grown.

Although alignment mechanisms have been introducedto deter misuse, they remain vulnerable to carefully designed adversarialprompts.

In this work, we present a scalable attack strategy: intent-hidingadversarial prompting, which conceals malicious intent through the compositionof skills.

We develop a game-theoretic framework to model the interactionbetween such attacks and defense systems that apply both prompt and responsefiltering.

Our analysis identifies equilibrium points and reveals structuraladvantages for the attacker.

To counter these threats, we propose and analyze adefense mechanism tailored to intent-hiding attacks.

Empirically, we validatethe attack's effectiveness on multiple real-world LLMs across a range ofmalicious behaviors, demonstrating clear advantages over existing adversarialprompting techniques.

Published on arXiv on: 2025-05-27T07:59:56Z