Link: http://arxiv.org/abs/2508.20038v1
PDF Link: http://arxiv.org/pdf/2508.20038v1
Summary: Despite advances in improving large language model(LLM) to refuse to answermalicious instructions, widely used LLMs remain vulnerable to jailbreak attackswhere attackers generate instructions with distributions differing from safetyalignment corpora.
New attacks expose LLMs' inability to recognize unseenmalicious instructions, highlighting a critical distributional mismatch betweentraining data and real-world attacks that forces developers into reactivepatching cycles.
To tackle this challenge, we propose IMAGINE, a synthesisframework that leverages embedding space distribution analysis to generatejailbreak-like instructions.
This approach effectively fills the distributionalgap between authentic jailbreak patterns and safety alignment corpora.
IMAGINEfollows an iterative optimization process that dynamically evolves textgeneration distributions across iterations, thereby augmenting the coverage ofsafety alignment data distributions through synthesized data examples.
Based onthe safety-aligned corpus enhanced through IMAGINE, our framework demonstratessignificant decreases in attack success rate on Qwen2.
5, Llama3.
1, and Llama3.
2without compromising their utility.
Published on arXiv on: 2025-08-27T16:44:03Z