Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking

Link: http://arxiv.org/abs/2502.13527v1

PDF Link: http://arxiv.org/pdf/2502.13527v1

Summary: The rise of Large Language Models (LLMs) has led to significant applicationsbut also introduced serious security threats, particularly from jailbreakattacks that manipulate output generation.

These attacks utilize promptengineering and logit manipulation to steer models toward harmful content,prompting LLM providers to implement filtering and safety alignment strategies.

We investigate LLMs' safety mechanisms and their recent applications, revealinga new threat model targeting structured output interfaces, which enableattackers to manipulate the inner logit during LLM generation, requiring onlyAPI access permissions.

To demonstrate this threat model, we introduce ablack-box attack framework called AttackPrefixTree (APT).

APT exploitsstructured output interfaces to dynamically construct attack patterns.

Byleveraging prefixes of models' safety refusal response and latent harmfuloutputs, APT effectively bypasses safety measures.

Experiments on benchmarkdatasets indicate that this approach achieves higher attack success rate thanexisting methods.

This work highlights the urgent need for LLM providers toenhance security protocols to address vulnerabilities arising from theinteraction between safety patterns and structured outputs.

Published on arXiv on: 2025-02-19T08:29:36Z