Link: http://arxiv.org/abs/2504.04976v1
PDF Link: http://arxiv.org/pdf/2504.04976v1
Summary: The study of large language models (LLMs) is a key area in open-world machinelearning.
Although LLMs demonstrate remarkable natural language processingcapabilities, they also face several challenges, including consistency issues,hallucinations, and jailbreak vulnerabilities.
Jailbreaking refers to thecrafting of prompts that bypass alignment safeguards, leading to unsafe outputsthat compromise the integrity of LLMs.
This work specifically focuses on thechallenge of jailbreak vulnerabilities and introduces a novel taxonomy ofjailbreak attacks grounded in the training domains of LLMs.
It characterizesalignment failures through generalization, objectives, and robustness gaps.
Ourprimary contribution is a perspective on jailbreak, framed through thedifferent linguistic domains that emerge during LLM training and alignment.
This viewpoint highlights the limitations of existing approaches and enables usto classify jailbreak attacks on the basis of the underlying model deficienciesthey exploit.
Unlike conventional classifications that categorize attacks basedon prompt construction methods (e.
g.
, prompt templating), our approach providesa deeper understanding of LLM behavior.
We introduce a taxonomy with fourcategories -- mismatched generalization, competing objectives, adversarialrobustness, and mixed attacks -- offering insights into the fundamental natureof jailbreak vulnerabilities.
Finally, we present key lessons derived from thistaxonomic study.
Published on arXiv on: 2025-04-07T12:05:16Z