Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

Link: http://arxiv.org/abs/2501.01830v1

PDF Link: http://arxiv.org/pdf/2501.01830v1

Summary: Automated red-teaming has become a crucial approach for uncoveringvulnerabilities in large language models (LLMs).

However, most existing methodsfocus on isolated safety flaws, limiting their ability to adapt to dynamicdefenses and uncover complex vulnerabilities efficiently.

To address thischallenge, we propose Auto-RT, a reinforcement learning framework thatautomatically explores and optimizes complex attack strategies to effectivelyuncover security vulnerabilities through malicious queries.

Specifically, weintroduce two key mechanisms to reduce exploration complexity and improvestrategy optimization: 1) Early-terminated Exploration, which accelerateexploration by focusing on high-potential attack strategies; and 2) ProgressiveReward Tracking algorithm with intermediate downgrade models, which dynamicallyrefine the search trajectory toward successful vulnerability exploitation.

Extensive experiments across diverse LLMs demonstrate that, by significantlyimproving exploration efficiency and automatically optimizing attackstrategies, Auto-RT detects a boarder range of vulnerabilities, achieving afaster detection speed and 16.

63\% higher success rates compared to existingmethods.

Published on arXiv on: 2025-01-03T14:30:14Z