CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Link: http://arxiv.org/abs/2507.06043v1

PDF Link: http://arxiv.org/pdf/2507.06043v1

Summary: Security alignment enables the Large Language Model (LLM) to gain theprotection against malicious queries, but various jailbreak attack methodsreveal the vulnerability of this security mechanism.

Previous studies haveisolated LLM jailbreak attacks and defenses.

We analyze the security protectionmechanism of the LLM, and propose a framework that combines attack and defense.

Our method is based on the linearly separable property of LLM intermediatelayer embedding, as well as the essence of jailbreak attack, which aims toembed harmful problems and transfer them to the safe area.

We utilizegenerative adversarial network (GAN) to learn the security judgment boundaryinside the LLM to achieve efficient jailbreak attack and defense.

Theexperimental results indicate that our method achieves an average jailbreaksuccess rate of 88.

85\% across three popular LLMs, while the defense successrate on the state-of-the-art jailbreak dataset reaches an average of 84.

17\%.

This not only validates the effectiveness of our approach but also sheds lighton the internal security mechanisms of LLMs, offering new insights forenhancing model security The code and data are available athttps://github.

com/NLPGM/CAVGAN.

Published on arXiv on: 2025-07-08T14:45:21Z