Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Link: http://arxiv.org/abs/2501.07959v1

PDF Link: http://arxiv.org/pdf/2501.07959v1

Summary: Recently, several works have been conducted on jailbreaking Large LanguageModels (LLMs) with few-shot malicious demos.

In particular, Zheng et al.

(2024)focuses on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injectingspecial tokens into the demos and employing demo-level random search.

Nevertheless, this method lacks generality since it specifies theinstruction-response structure.

Moreover, the reason why inserting specialtokens takes effect in inducing harmful behaviors is only empiricallydiscussed.

In this paper, we take a deeper insight into the mechanism ofspecial token injection and propose Self-Instruct Few-Shot Jailbreaking(Self-Instruct-FSJ) facilitated with the demo-level greedy search.

Thisframework decomposes the FSJ attack into pattern and behavior learning toexploit the model's vulnerabilities in a more generalized and efficient way.

Weconduct elaborate experiments to evaluate our method on common open-sourcemodels and compare it with baseline algorithms.

Our code is available athttps://github.

com/iphosi/Self-Instruct-FSJ.

Published on arXiv on: 2025-01-14T09:23:30Z