Link: http://arxiv.org/abs/2505.16947v1
PDF Link: http://arxiv.org/pdf/2505.16947v1
Summary: Despite recent efforts in Large Language Models (LLMs) safety and alignment,current adversarial attacks on frontier LLMs are still able to force harmfulgenerations consistently.
Although adversarial training has been widely studiedand shown to significantly improve the robustness of traditional machinelearning models, its strengths and weaknesses in the context of LLMs are lessunderstood.
Specifically, while existing discrete adversarial attacks areeffective at producing harmful content, training LLMs with concrete adversarialprompts is often computationally expensive, leading to reliance on continuousrelaxations.
As these relaxations do not correspond to discrete input tokens,such latent training methods often leave models vulnerable to a diverse set ofdiscrete attacks.
In this work, we aim to bridge this gap by introducing MixAT,a novel method that combines stronger discrete and faster continuous attacksduring training.
We rigorously evaluate MixAT across a wide spectrum ofstate-of-the-art attacks, proposing the At Least One Attack Success Rate(ALO-ASR) metric to capture the worst-case vulnerability of models.
We showMixAT achieves substantially better robustness (ALO-ASR < 20%) compared toprior defenses (ALO-ASR > 50%), while maintaining a runtime comparable tomethods based on continuous relaxations.
We further analyze MixAT in realisticdeployment settings, exploring how chat templates, quantization, low-rankadapters, and temperature affect both adversarial training and evaluation,revealing additional blind spots in current methodologies.
Our resultsdemonstrate that MixAT's discrete-continuous defense offers a principled andsuperior robustness-accuracy tradeoff with minimal computational overhead,highlighting its promise for building safer LLMs.
We provide our code andmodels at https://github.
com/insait-institute/MixAT.
Published on arXiv on: 2025-05-22T17:32:50Z