CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models

Link: http://arxiv.org/abs/2502.11379v1

PDF Link: http://arxiv.org/pdf/2502.11379v1

Summary: Despite explicit alignment efforts for large language models (LLMs), they canstill be exploited to trigger unintended behaviors, a phenomenon known as"jailbreaking.

" Current jailbreak attack methods mainly focus on discreteprompt manipulations targeting closed-source LLMs, relying on manually craftedprompt templates and persuasion rules.

However, as the capabilities ofopen-source LLMs improve, ensuring their safety becomes increasingly crucial.

In such an environment, the accessibility of model parameters and gradientinformation by potential attackers exacerbates the severity of jailbreakthreats.

To address this research gap, we propose a novel\underline{C}ontext-\underline{C}oherent \underline{J}ailbreak\underline{A}ttack (CCJA).

We define jailbreak attacks as an optimizationproblem within the embedding space of masked language models.

Throughcombinatorial optimization, we effectively balance the jailbreak attack successrate with semantic coherence.

Extensive evaluations show that our method notonly maintains semantic consistency but also surpasses state-of-the-artbaselines in attack effectiveness.

Additionally, by integrating semanticallycoherent jailbreak prompts generated by our method into widely used black-boxmethodologies, we observe a notable enhancement in their success rates whentargeting closed-source commercial LLMs.

This highlights the security threatposed by open-source LLMs to commercial counterparts.

We will open-source ourcode if the paper is accepted.

Published on arXiv on: 2025-02-17T02:49:26Z