Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

Link: http://arxiv.org/abs/2508.14853v1

PDF Link: http://arxiv.org/pdf/2508.14853v1

Summary: As large language models (LLMs) are increasingly deployed in criticalapplications, ensuring their robustness and safety alignment remains a majorchallenge.

Despite the overall success of alignment techniques such asreinforcement learning from human feedback (RLHF) on typical prompts, LLMsremain vulnerable to jailbreak attacks enabled by crafted adversarial triggersappended to user prompts.

Most existing jailbreak methods either rely oninefficient searches over discrete token spaces or direct optimization ofcontinuous embeddings.

While continuous embeddings can be given directly toselected open-source models as input, doing so is not feasible for proprietarymodels.

On the other hand, projecting these embeddings back into valid discretetokens introduces additional complexity and often reduces attack effectiveness.

We propose an intrinsic optimization method which directly optimizes relaxedone-hot encodings of the adversarial suffix tokens using exponentiated gradientdescent coupled with Bregman projection, ensuring that the optimized one-hotencoding of each token always remains within the probability simplex.

Weprovide theoretical proof of convergence for our proposed method and implementan efficient algorithm that effectively jailbreaks several widely used LLMs.

Our method achieves higher success rates and faster convergence compared tothree state-of-the-art baselines, evaluated on five open-source LLMs and fouradversarial behavior datasets curated for evaluating jailbreak methods.

Inaddition to individual prompt attacks, we also generate universal adversarialsuffixes effective across multiple prompts and demonstrate transferability ofoptimized suffixes to different LLMs.

Published on arXiv on: 2025-08-20T17:03:32Z