REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Link: http://arxiv.org/abs/2502.17254v1

PDF Link: http://arxiv.org/pdf/2502.17254v1

Summary: To circumvent the alignment of large language models (LLMs), currentoptimization-based adversarial attacks usually craft adversarial prompts bymaximizing the likelihood of a so-called affirmative response.

An affirmativeresponse is a manually designed start of a harmful answer to an inappropriaterequest.

While it is often easy to craft prompts that yield a substantiallikelihood for the affirmative response, the attacked model frequently does notcomplete the response in a harmful manner.

Moreover, the affirmative objectiveis usually not adapted to model-specific preferences and essentially ignoresthe fact that LLMs output a distribution over responses.

If low attack successunder such an objective is taken as a measure of robustness, the truerobustness might be grossly overestimated.

To alleviate these flaws, we proposean adaptive and semantic optimization problem over the population of responses.

We derive a generally applicable objective via the REINFORCE policy-gradientformalism and demonstrate its efficacy with the state-of-the-art jailbreakalgorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent(PGD).

For example, our objective doubles the attack success rate (ASR) onLlama3 and increases the ASR from 2% to 50% with circuit breaker defense.

Published on arXiv on: 2025-02-24T15:34:48Z