Link: http://arxiv.org/abs/2505.09602v1
PDF Link: http://arxiv.org/pdf/2505.09602v1
Summary: Large Language Models (LLMs) are increasingly embedded in autonomous systemsand public-facing environments, yet they remain susceptible to jailbreakvulnerabilities that may undermine their security and trustworthiness.
Adversarial suffixes are considered to be the current state-of-the-artjailbreak, consistently outperforming simpler methods and frequently succeedingeven in black-box settings.
Existing defenses rely on access to the internalarchitecture of models limiting diverse deployment, increase memory andcomputation footprints dramatically, or can be bypassed with simple promptengineering methods.
We introduce $\textbf{Adversarial Suffix Filtering}$(ASF), a lightweight novel model-agnostic defensive pipeline designed toprotect LLMs against adversarial suffix attacks.
ASF functions as an inputpreprocessor and sanitizer that detects and filters adversarially craftedsuffixes in prompts, effectively neutralizing malicious injections.
Wedemonstrate that ASF provides comprehensive defense capabilities across bothblack-box and white-box attack settings, reducing the attack efficacy ofstate-of-the-art adversarial suffix generation methods to below 4%, while onlyminimally affecting the target model's capabilities in non-adversarialscenarios.
Published on arXiv on: 2025-05-14T17:52:10Z