Link: http://arxiv.org/abs/2504.15047v1
PDF Link: http://arxiv.org/pdf/2504.15047v1
Summary: Large Language Models (LLMs) exhibit remarkable capabilities but aresusceptible to adversarial prompts that exploit vulnerabilities to produceunsafe or biased outputs.
Existing red-teaming methods often face scalabilitychallenges, resource-intensive requirements, or limited diversity in attackstrategies.
We propose RainbowPlus, a novel red-teaming framework rooted inevolutionary computation, enhancing adversarial prompt generation through anadaptive quality-diversity (QD) search that extends classical evolutionaryalgorithms like MAP-Elites with innovations tailored for language models.
Byemploying a multi-element archive to store diverse high-quality prompts and acomprehensive fitness function to evaluate multiple prompts concurrently,RainbowPlus overcomes the constraints of single-prompt archives and pairwisecomparisons in prior QD methods like Rainbow Teaming.
Experiments comparingRainbowPlus to QD methods across six benchmark datasets and four open-sourceLLMs demonstrate superior attack success rate (ASR) and diversity(Diverse-Score $\approx 0.
84$), generating up to 100 times more unique prompts(e.
g.
, 10,418 vs.
100 for Ministral-8B-Instruct-2410).
Against ninestate-of-the-art methods on the HarmBench dataset with twelve LLMs (tenopen-source, two closed-source), RainbowPlus achieves an average ASR of 81.
1%,surpassing AutoDAN-Turbo by 3.
9%, and is 9 times faster (1.
45 vs.
13.
50 hours).
Our open-source implementation fosters further advancements in LLM safety,offering a scalable tool for vulnerability assessment.
Code and resources arepublicly available at https://github.
com/knoveleng/rainbowplus, supportingreproducibility and future research in LLM red-teaming.
Published on arXiv on: 2025-04-21T12:04:57Z