Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning

Link: http://arxiv.org/abs/2504.01278v1

PDF Link: http://arxiv.org/pdf/2504.01278v1

Summary: The exploitation of large language models (LLMs) for malicious purposes posessignificant security risks as these models become more powerful and widespread.

While most existing red-teaming frameworks focus on single-turn attacks,real-world adversaries typically operate in multi-turn scenarios, iterativelyprobing for vulnerabilities and adapting their prompts based on threat modelresponses.

In this paper, we propose \AlgName, a novel multi-turn red-teamingagent that emulates sophisticated human attackers through complementarylearning dimensions: global tactic-wise learning that accumulates knowledgeover time and generalizes to new attack goals, and local prompt-wise learningthat refines implementations for specific goals when initial attempts fail.

Unlike previous multi-turn approaches that rely on fixed strategy sets,\AlgName enables the agent to identify new jailbreak tactics, develop agoal-based tactic selection framework, and refine prompt formulations forselected tactics.

Empirical evaluations on JailbreakBench demonstrate ourframework's superior performance, achieving over 90\% attack success ratesagainst GPT-3.

5-Turbo and Llama-3.

1-70B within 5 conversation turns,outperforming state-of-the-art baselines.

These results highlight theeffectiveness of dynamic learning in identifying and exploiting modelvulnerabilities in realistic multi-turn scenarios.

Published on arXiv on: 2025-04-02T01:06:19Z