XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs

Link: http://arxiv.org/abs/2504.21700v1

PDF Link: http://arxiv.org/pdf/2504.21700v1

Summary: Large Language Models are fundamental actors in the modern IT landscapedominated by AI solutions.

However, security threats associated with them mightprevent their reliable adoption in critical application scenarios such asgovernment organizations and medical institutions.

For this reason, commercialLLMs typically undergo a sophisticated censoring mechanism to eliminate anyharmful output they could possibly produce.

In response to this, LLMJailbreaking is a significant threat to such protections, and many previousapproaches have already demonstrated its effectiveness across diverse domains.

Existing jailbreak proposals mostly adopt a generate-and-test strategy to craftmalicious input.

To improve the comprehension of censoring mechanisms anddesign a targeted jailbreak attack, we propose an Explainable-AI solution thatcomparatively analyzes the behavior of censored and uncensored models to deriveunique exploitable alignment patterns.

Then, we propose XBreaking, a noveljailbreak attack that exploits these unique patterns to break the securityconstraints of LLMs by targeted noise injection.

Our thorough experimentalcampaign returns important insights about the censoring mechanisms anddemonstrates the effectiveness and performance of our attack.

Published on arXiv on: 2025-04-30T14:44:24Z