An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

Link: http://arxiv.org/abs/2511.02356v1

PDF Link: http://arxiv.org/pdf/2511.02356v1

Summary: The widespread deployment of Large Language Models (LLMs) as public-facingweb services and APIs has made their security a core concern for the webecosystem.

Jailbreak attacks, as one of the significant threats to LLMs, haverecently attracted extensive research.

In this paper, we reveal a jailbreakstrategy which can effectively evade current defense strategies.

It can extractvaluable information from failed or partially successful attack attempts andcontains self-evolution from attack interactions, resulting in sufficientstrategy diversity and adaptability.

Inspired by continuous learning andmodular design principles, we propose ASTRA, a jailbreak framework thatautonomously discovers, retrieves, and evolves attack strategies to achievemore efficient and adaptive attacks.

To enable this autonomous evolution, wedesign a closed-loop "attack-evaluate-distill-reuse" core mechanism that notonly generates attack prompts but also automatically distills and generalizesreusable attack strategies from every interaction.

To systematically accumulateand apply this attack knowledge, we introduce a three-tier strategy librarythat categorizes strategies into Effective, Promising, and Ineffective based ontheir performance scores.

The strategy library not only provides preciseguidance for attack generation but also possesses exceptional extensibility andtransferability.

We conduct extensive experiments under a black-box setting,and the results show that ASTRA achieves an average Attack Success Rate (ASR)of 82.

7%, significantly outperforming baselines.

Published on arXiv on: 2025-11-04T08:24:22Z