AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Link: http://arxiv.org/abs/2511.04316v1

PDF Link: http://arxiv.org/pdf/2511.04316v1

Summary: The rapid expansion of research on Large Language Model (LLM) safety androbustness has produced a fragmented and oftentimes buggy ecosystem ofimplementations, datasets, and evaluation methods.

This fragmentation makesreproducibility and comparability across studies challenging, hinderingmeaningful progress.

To address these issues, we introduce AdversariaLLM, atoolbox for conducting LLM jailbreak robustness research.

Its design centers onreproducibility, correctness, and extensibility.

The framework implementstwelve adversarial attack algorithms, integrates seven benchmark datasetsspanning harmfulness, over-refusal, and utility evaluation, and provides accessto a wide range of open-weight LLMs via Hugging Face.

The implementationincludes advanced features for comparability and reproducibility such ascompute-resource tracking, deterministic results, and distributional evaluationtechniques.

\name also integrates judging through the companion packageJudgeZoo, which can also be used independently.

Together, these components aimto establish a robust foundation for transparent, comparable, and reproducibleresearch in LLM safety.

Published on arXiv on: 2025-11-06T12:38:09Z