Link: http://arxiv.org/abs/2501.00055v1
PDF Link: http://arxiv.org/pdf/2501.00055v1
Summary: While safety-aligned large language models (LLMs) are increasingly used asthe cornerstone for powerful systems such as multi-agent frameworks to solvecomplex real-world problems, they still suffer from potential adversarialqueries, such as jailbreak attacks, which attempt to induce harmful content.
Researching attack methods allows us to better understand the limitations ofLLM and make trade-offs between helpfulness and safety.
However, existingjailbreak attacks are primarily based on opaque optimization techniques (e.
g.
token-level gradient descent) and heuristic search methods like LLM refinement,which fall short in terms of transparency, transferability, and computationalcost.
In light of these limitations, we draw inspiration from the evolution andinfection processes of biological viruses and propose LLM-Virus, a jailbreakattack method based on evolutionary algorithm, termed evolutionary jailbreak.
LLM-Virus treats jailbreak attacks as both an evolutionary and transferlearning problem, utilizing LLMs as heuristic evolutionary operators to ensurehigh attack efficiency, transferability, and low time cost.
Our experimentalresults on multiple safety benchmarks show that LLM-Virus achieves competitiveor even superior performance compared to existing attack methods.
Published on arXiv on: 2024-12-28T07:48:57Z