Best-of-N Jailbreaking

Link: http://arxiv.org/abs/2412.03556v1

PDF Link: http://arxiv.org/pdf/2412.03556v1

Summary: We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm thatjailbreaks frontier AI systems across modalities.

BoN Jailbreaking works byrepeatedly sampling variations of a prompt with a combination of augmentations- such as random shuffling or capitalization for textual prompts - until aharmful response is elicited.

We find that BoN Jailbreaking achieves highattack success rates (ASRs) on closed-source language models, such as 89% onGPT-4o and 78% on Claude 3.

5 Sonnet when sampling 10,000 augmented prompts.

Further, it is similarly effective at circumventing state-of-the-artopen-source defenses like circuit breakers.

BoN also seamlessly extends toother modalities: it jailbreaks vision language models (VLMs) such as GPT-4oand audio language models (ALMs) like Gemini 1.

5 Pro, using modality-specificaugmentations.

BoN reliably improves when we sample more augmented prompts.

Across all modalities, ASR, as a function of the number of samples (N),empirically follows power-law-like behavior for many orders of magnitude.

BoNJailbreaking can also be composed with other black-box algorithms for even moreeffective attacks - combining BoN with an optimized prefix attack achieves upto a 35% increase in ASR.

Overall, our work indicates that, despite theircapability, language models are sensitive to seemingly innocuous changes toinputs, which attackers can exploit across modalities.

Published on arXiv on: 2024-12-04T18:51:32Z