Link: http://arxiv.org/abs/2507.11097v1
PDF Link: http://arxiv.org/pdf/2507.11097v1
Summary: Diffusion-based large language models (dLLMs) have recently emerged as apowerful alternative to autoregressive LLMs, offering faster inference andgreater interactivity via parallel decoding and bidirectional modeling.
However, despite strong performance in code generation and text infilling, weidentify a fundamental safety concern: existing alignment mechanisms fail tosafeguard dLLMs against context-aware, masked-input adversarial prompts,exposing novel vulnerabilities.
To this end, we present DIJA, the firstsystematic study and jailbreak attack framework that exploits unique safetyweaknesses of dLLMs.
Specifically, our proposed DIJA constructs adversarialinterleaved mask-text prompts that exploit the text generation mechanisms ofdLLMs, i.
e.
, bidirectional modeling and parallel decoding.
Bidirectionalmodeling drives the model to produce contextually consistent outputs for maskedspans, even when harmful, while parallel decoding limits model dynamicfiltering and rejection sampling of unsafe content.
This causes standardalignment mechanisms to fail, enabling harmful completions in alignment-tuneddLLMs, even when harmful behaviors or unsafe instructions are directly exposedin the prompt.
Through comprehensive experiments, we demonstrate that DIJAsignificantly outperforms existing jailbreak methods, exposing a previouslyoverlooked threat surface in dLLM architectures.
Notably, our method achievesup to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest priorbaseline, ReNeLLM, by up to 78.
5% in evaluator-based ASR on JailbreakBench andby 37.
7 points in StrongREJECT score, while requiring no rewriting or hiding ofharmful content in the jailbreak prompt.
Our findings underscore the urgentneed for rethinking safety alignment in this emerging class of language models.
Code is available at https://github.
com/ZichenWen1/DIJA.
Published on arXiv on: 2025-07-15T08:44:46Z