CURVALID: Geometrically-guided Adversarial Prompt Detection

Link: http://arxiv.org/abs/2503.03502v1

PDF Link: http://arxiv.org/pdf/2503.03502v1

Summary: Adversarial prompts capable of jailbreaking large language models (LLMs) andinducing undesirable behaviours pose a significant obstacle to their safedeployment.

Current mitigation strategies rely on activating built-in defencemechanisms or fine-tuning the LLMs, but the fundamental distinctions betweenadversarial and benign prompts are yet to be understood.

In this work, weintroduce CurvaLID, a novel defense framework that efficiently detectsadversarial prompts by leveraging their geometric properties.

It is agnostic tothe type of LLM, offering a unified detection framework across diverseadversarial prompts and LLM architectures.

CurvaLID builds on the geometricanalysis of text prompts to uncover their underlying differences.

Wetheoretically extend the concept of curvature via the Whewell equation into an$n$-dimensional word embedding space, enabling us to quantify local geometricproperties, including semantic shifts and curvature in the underlyingmanifolds.

Additionally, we employ Local Intrinsic Dimensionality (LID) tocapture geometric features of text prompts within adversarial subspaces.

Ourfindings reveal that adversarial prompts differ fundamentally from benignprompts in terms of their geometric characteristics.

Our results demonstratethat CurvaLID delivers superior detection and rejection of adversarial queries,paving the way for safer LLM deployment.

The source code can be found athttps://github.

com/Cancanxxx/CurvaLID

Published on arXiv on: 2025-03-05T13:47:53Z