Skip to content
arxiv papers 1 min read

Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

Link: http://arxiv.org/abs/2502.19041v1

PDF Link: http://arxiv.org/pdf/2502.19041v1

Summary: Although Aligned Large Language Models (LLMs) are trained to refuse harmfulrequests, they remain vulnerable to jailbreak attacks.

Unfortunately, existingmethods often focus on surface-level patterns, overlooking the deeper attackessences.

As a result, defenses fail when attack prompts change, even thoughthe underlying "attack essence" remains the same.

To address this issue, weintroduce EDDF, an \textbf{E}ssence-\textbf{D}riven \textbf{D}efense\textbf{F}ramework Against Jailbreak Attacks in LLMs.

EDDF is a plug-and-playinput-filtering method and operates in two stages: 1) offline essence databaseconstruction, and 2) online adversarial query detection.

The key idea behindEDDF is to extract the "attack essence" from a diverse set of known attackinstances and store it in an offline vector database.

Experimental resultsdemonstrate that EDDF significantly outperforms existing methods by reducingthe Attack Success Rate by at least 20\%, underscoring its superior robustnessagainst jailbreak attacks.

Published on arXiv on: 2025-02-26T10:53:58Z