Link: http://arxiv.org/abs/2501.07959v1
PDF Link: http://arxiv.org/pdf/2501.07959v1
Summary: Recently, several works have been conducted on jailbreaking Large LanguageModels (LLMs) with few-shot malicious demos.
In particular, Zheng et al.
(2024)focuses on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injectingspecial tokens into the demos and employing demo-level random search.
Nevertheless, this method lacks generality since it specifies theinstruction-response structure.
Moreover, the reason why inserting specialtokens takes effect in inducing harmful behaviors is only empiricallydiscussed.
In this paper, we take a deeper insight into the mechanism ofspecial token injection and propose Self-Instruct Few-Shot Jailbreaking(Self-Instruct-FSJ) facilitated with the demo-level greedy search.
Thisframework decomposes the FSJ attack into pattern and behavior learning toexploit the model's vulnerabilities in a more generalized and efficient way.
Weconduct elaborate experiments to evaluate our method on common open-sourcemodels and compare it with baseline algorithms.
Our code is available athttps://github.
com/iphosi/Self-Instruct-FSJ.
Published on arXiv on: 2025-01-14T09:23:30Z