Link: http://arxiv.org/abs/2503.08990v1
PDF Link: http://arxiv.org/pdf/2503.08990v1
Summary: Large language models (LLMs) have shown great promise as languageunderstanding and decision making tools, and they have permeated variousaspects of our everyday life.
However, their widespread availability also comeswith novel risks, such as generating harmful, unethical, or offensive content,via an attack called jailbreaking.
Despite extensive efforts from LLMdevelopers to align LLMs using human feedback, they are still susceptible tojailbreak attacks.
To tackle this issue, researchers often employ red-teamingto understand and investigate jailbreak prompts.
However, existing red-teamingapproaches lack effectiveness, scalability, or both.
To address these issues,we propose JBFuzz, a novel effective, automated, and scalable red-teamingtechnique for jailbreaking LLMs.
JBFuzz is inspired by the success of fuzzing for detectingbugs/vulnerabilities in software.
We overcome three challenges related toeffectiveness and scalability by devising novel seed prompts, a lightweightmutation engine, and a lightweight and accurate evaluator for guiding thefuzzer.
Assimilating all three solutions results in a potent fuzzer that onlyrequires black-box access to the target LLM.
We perform extensive experimentalevaluation of JBFuzz using nine popular and widely-used LLMs.
We find thatJBFuzz successfully jailbreaks all LLMs for various harmful/unethicalquestions, with an average attack success rate of 99%.
We also find that JBFuzzis extremely efficient as it jailbreaks a given LLM for a given question in 60seconds on average.
Our work highlights the susceptibility of thestate-of-the-art LLMs to jailbreak attacks even after safety alignment, andserves as a valuable red-teaming tool for LLM developers.
Published on arXiv on: 2025-03-12T01:52:17Z