Link: http://arxiv.org/abs/2506.08885v1
PDF Link: http://arxiv.org/pdf/2506.08885v1
Summary: Adversarial threats against LLMs are escalating faster than current defensescan adapt.
We expose a critical geometric blind spot in alignment: adversarialprompts exploit latent camouflage, embedding perilously close to the saferepresentation manifold while encoding unsafe intent thereby evading surfacelevel defenses like Direct Preference Optimization (DPO), which remain blind tothe latent geometry.
We introduce ALKALI, the first rigorously curatedadversarial benchmark and the most comprehensive to date spanning 9,000 promptsacross three macro categories, six subtypes, and fifteen attack families.
Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates(ASRs) across both open and closed source models, exposing an underlyingvulnerability we term latent camouflage, a structural blind spot whereadversarial completions mimic the latent geometry of safe ones.
To mitigatethis vulnerability, we introduce GRACE - Geometric Representation AwareContrastive Enhancement, an alignment framework coupling preference learningwith latent space regularization.
GRACE enforces two constraints: latentseparation between safe and adversarial completions, and adversarial cohesionamong unsafe and jailbreak behaviors.
These operate over layerwise pooledembeddings guided by a learned attention profile, reshaping internal geometrywithout modifying the base model, and achieve up to 39% ASR reduction.
Moreover, we introduce AVQI, a geometry aware metric that quantifies latentalignment failure via cluster separation and compactness.
AVQI reveals whenunsafe completions mimic the geometry of safe ones, offering a principled lensinto how models internally encode safety.
We make the code publicly availableat https://anonymous.
4open.
science/r/alkali-B416/README.
md.
Published on arXiv on: 2025-06-10T15:14:17Z