Skip to content
arxiv papers 1 min read

PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Link: http://arxiv.org/abs/2502.01925v1

PDF Link: http://arxiv.org/pdf/2502.01925v1

Summary: Many-shot jailbreaking circumvents the safety alignment of large languagemodels by exploiting their ability to process long input sequences.

To achievethis, the malicious target prompt is prefixed with hundreds of fabricatedconversational turns between the user and the model.

These fabricated exchangesare randomly sampled from a pool of malicious questions and responses, makingit appear as though the model has already complied with harmful instructions.

In this paper, we present PANDAS: a hybrid technique that improves many-shotjailbreaking by modifying these fabricated dialogues with positiveaffirmations, negative demonstrations, and an optimized adaptive samplingmethod tailored to the target prompt's topic.

Extensive experiments on AdvBenchand HarmBench, using state-of-the-art LLMs, demonstrate that PANDASsignificantly outperforms baseline methods in long-context scenarios.

Throughan attention analysis, we provide insights on how long-context vulnerabilitiesare exploited and show how PANDAS further improves upon many-shot jailbreaking.

Published on arXiv on: 2025-02-04T01:51:31Z