Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms

Link: http://arxiv.org/abs/2503.24191v1

PDF Link: http://arxiv.org/pdf/2503.24191v1

Summary: Content Warning: This paper may contain unsafe or harmful content generatedby LLMs that may be offensive to readers.

Large Language Models (LLMs) areextensively used as tooling platforms through structured output APIs to ensuresyntax compliance so that robust integration with existing softwares like agentsystems, could be achieved.

However, the feature enabling functionality ofgrammar-guided structured output presents significant security vulnerabilities.

In this work, we reveal a critical control-plane attack surface orthogonal totraditional data-plane vulnerabilities.

We introduce Constrained DecodingAttack (CDA), a novel jailbreak class that weaponizes structured outputconstraints to bypass safety mechanisms.

Unlike prior attacks focused on inputprompts, CDA operates by embedding malicious intent in schema-level grammarrules (control-plane) while maintaining benign surface prompts (data-plane).

Weinstantiate this with a proof-of-concept Chain Enum Attack, achieves 96.

2%attack success rates across proprietary and open-weight LLMs on five safetybenchmarks with a single query, including GPT-4o and Gemini-2.

0-flash.

Ourfindings identify a critical security blind spot in current LLM architecturesand urge a paradigm shift in LLM safety to address control-planevulnerabilities, as current mechanisms focused solely on data-plane threatsleave critical systems exposed.

Published on arXiv on: 2025-03-31T15:08:06Z