DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak

Link: http://arxiv.org/abs/2412.17522v1

PDF Link: http://arxiv.org/pdf/2412.17522v1

Summary: Large Language Models (LLMs) are susceptible to generating harmful contentwhen prompted with carefully crafted inputs, a vulnerability known as LLMjailbreaking.

As LLMs become more powerful, studying jailbreak methods iscritical to enhancing security and aligning models with human values.

Traditionally, jailbreak techniques have relied on suffix addition or prompttemplates, but these methods suffer from limited attack diversity.

This paperintroduces DiffusionAttacker, an end-to-end generative approach for jailbreakrewriting inspired by diffusion models.

Our method employs asequence-to-sequence (seq2seq) text diffusion model as a generator,conditioning on the original prompt and guiding the denoising process with anovel attack loss.

Unlike previous approaches that use autoregressive LLMs togenerate jailbreak prompts, which limit the modification of already generatedtokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seqdiffusion model, allowing more flexible token modifications.

This approachpreserves the semantic content of the original prompt while producing harmfulcontent.

Additionally, we leverage the Gumbel-Softmax technique to make thesampling process from the diffusion model's output distribution differentiable,eliminating the need for iterative token search.

Extensive experiments onAdvbench and Harmbench demonstrate that DiffusionAttacker outperforms previousmethods across various evaluation metrics, including attack success rate (ASR),fluency, and diversity.

Published on arXiv on: 2024-12-23T12:44:54Z