Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion

Link: http://arxiv.org/abs/2505.14316v1

PDF Link: http://arxiv.org/pdf/2505.14316v1

Summary: Although large language models (LLMs) have achieved remarkable advancements,their security remains a pressing concern.

One major threat is jailbreakattacks, where adversarial prompts bypass model safeguards to generate harmfulor objectionable content.

Researchers study jailbreak attacks to understandsecurity and robustness of LLMs.

However, existing jailbreak attack methodsface two main challenges: (1) an excessive number of iterative queries, and (2)poor generalization across models.

In addition, recent jailbreak evaluationdatasets focus primarily on question-answering scenarios, lacking attention totext generation tasks that require accurate regeneration of toxic content.

Totackle these challenges, we propose two contributions: (1) ICE, a novelblack-box jailbreak method that employs Intent Concealment and divErsion toeffectively circumvent security constraints.

ICE achieves high attack successrates (ASR) with a single query, significantly improving efficiency andtransferability across different models.

(2) BiSceneEval, a comprehensivedataset designed for assessing LLM robustness in question-answering andtext-generation tasks.

Experimental results demonstrate that ICE outperformsexisting jailbreak techniques, revealing critical vulnerabilities in currentdefense mechanisms.

Our findings underscore the necessity of a hybrid securitystrategy that integrates predefined security mechanisms with real-time semanticdecomposition to enhance the security of LLMs.

Published on arXiv on: 2025-05-20T13:03:15Z