Link: http://arxiv.org/abs/2504.05050v1
PDF Link: http://arxiv.org/pdf/2504.05050v1
Summary: Large language models (LLMs) are foundational explorations to artificialgeneral intelligence, yet their alignment with human values via instructiontuning and preference learning achieves only superficial compliance.
Here, wedemonstrate that harmful knowledge embedded during pretraining persists asindelible "dark patterns" in LLMs' parametric memory, evading alignmentsafeguards and resurfacing under adversarial inducement at distributionalshifts.
In this study, we first theoretically analyze the intrinsic ethicalvulnerability of aligned LLMs by proving that current alignment methods yieldonly local "safety regions" in the knowledge manifold.
In contrast, pretrainedknowledge remains globally connected to harmful concepts via high-likelihoodadversarial trajectories.
Building on this theoretical insight, we empiricallyvalidate our findings by employing semantic coherence inducement underdistributional shifts--a method that systematically bypasses alignmentconstraints through optimized adversarial prompts.
This combined theoreticaland empirical approach achieves a 100% attack success rate across 19 out of 23state-of-the-art aligned LLMs, including DeepSeek-R1 and LLaMA-3, revealingtheir universal vulnerabilities.
Published on arXiv on: 2025-04-07T13:20:17Z