Link: http://arxiv.org/abs/2504.20965v1
PDF Link: http://arxiv.org/pdf/2504.20965v1
Summary: We introduce AegisLLM, a cooperative multi-agent defense against adversarialattacks and information leakage.
In AegisLLM, a structured workflow ofautonomous agents - orchestrator, deflector, responder, and evaluator -collaborate to ensure safe and compliant LLM outputs, while self-improving overtime through prompt optimization.
We show that scaling agentic reasoning systemat test-time - both by incorporating additional agent roles and by leveragingautomated prompt optimization (such as DSPy)- substantially enhances robustnesswithout compromising model utility.
This test-time defense enables real-timeadaptability to evolving attacks, without requiring model retraining.
Comprehensive evaluations across key threat scenarios, including unlearning andjailbreaking, demonstrate the effectiveness of AegisLLM.
On the WMDP unlearningbenchmark, AegisLLM achieves near-perfect unlearning with only 20 trainingexamples and fewer than 300 LM calls.
For jailbreaking benchmarks, we achieve51% improvement compared to the base model on StrongReject, with false refusalrates of only 7.
9% on PHTest compared to 18-55% for comparable methods.
Ourresults highlight the advantages of adaptive, agentic reasoning over staticdefenses, establishing AegisLLM as a strong runtime alternative to traditionalapproaches based on model modifications.
Code is available athttps://github.
com/zikuicai/aegisllm
Published on arXiv on: 2025-04-29T17:36:05Z