AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

Link: http://arxiv.org/abs/2504.20965v1

PDF Link: http://arxiv.org/pdf/2504.20965v1

Summary: We introduce AegisLLM, a cooperative multi-agent defense against adversarialattacks and information leakage.

In AegisLLM, a structured workflow ofautonomous agents - orchestrator, deflector, responder, and evaluator -collaborate to ensure safe and compliant LLM outputs, while self-improving overtime through prompt optimization.

We show that scaling agentic reasoning systemat test-time - both by incorporating additional agent roles and by leveragingautomated prompt optimization (such as DSPy)- substantially enhances robustnesswithout compromising model utility.

This test-time defense enables real-timeadaptability to evolving attacks, without requiring model retraining.

Comprehensive evaluations across key threat scenarios, including unlearning andjailbreaking, demonstrate the effectiveness of AegisLLM.

On the WMDP unlearningbenchmark, AegisLLM achieves near-perfect unlearning with only 20 trainingexamples and fewer than 300 LM calls.

For jailbreaking benchmarks, we achieve51% improvement compared to the base model on StrongReject, with false refusalrates of only 7.

9% on PHTest compared to 18-55% for comparable methods.

Ourresults highlight the advantages of adaptive, agentic reasoning over staticdefenses, establishing AegisLLM as a strong runtime alternative to traditionalapproaches based on model modifications.

Code is available athttps://github.

com/zikuicai/aegisllm

Published on arXiv on: 2025-04-29T17:36:05Z