Skip to content
arxiv papers 1 min read

Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

Link: http://arxiv.org/abs/2505.15738v1

PDF Link: http://arxiv.org/pdf/2505.15738v1

Summary: Large language models (LLMs) are rapidly deployed in real-world applicationsranging from chatbots to agentic systems.

Alignment is one of the mainapproaches used to defend against attacks such as prompt injection andjailbreaks.

Recent defenses report near-zero Attack Success Rates (ASR) evenagainst Greedy Coordinate Gradient (GCG), a white-box attack that generatesadversarial suffixes to induce attacker-desired outputs.

However, this searchspace over discrete tokens is extremely large, making the task of findingsuccessful attacks difficult.

GCG has, for instance, been shown to converge tolocal minima, making it sensitive to initialization choices.

In this paper, weassess the future-proof robustness of these defenses using a more informedthreat model: attackers who have access to some information about the alignmentprocess.

Specifically, we propose an informed white-box attack leveraging theintermediate model checkpoints to initialize GCG, with each checkpoint actingas a stepping stone for the next one.

We show this approach to be highlyeffective across state-of-the-art (SOTA) defenses and models.

We further showour informed initialization to outperform other initialization methods and showa gradient-informed checkpoint selection strategy to greatly improve attackperformance and efficiency.

Importantly, we also show our method tosuccessfully find universal adversarial suffixes -- single suffixes effectiveacross diverse inputs.

Our results show that, contrary to previous beliefs,effective adversarial suffixes do exist against SOTA alignment-based defenses,that these can be found by existing attack methods when adversaries exploitalignment knowledge, and that even universal suffixes exist.

Taken together,our results highlight the brittleness of current alignment-based methods andthe need to consider stronger threat models when testing the safety of LLMs.

Published on arXiv on: 2025-05-21T16:43:17Z