Link: http://arxiv.org/abs/2503.09066v1
PDF Link: http://arxiv.org/pdf/2503.09066v1
Summary: Large Language Models (LLMs) have demonstrated remarkable capabilities acrossvarious tasks, yet they remain vulnerable to adversarial manipulations such asjailbreaking via prompt injection attacks.
These attacks bypass safetymechanisms to generate restricted or harmful content.
In this study, weinvestigated the underlying latent subspaces of safe and jailbroken states byextracting hidden activations from a LLM.
Inspired by attractor dynamics inneuroscience, we hypothesized that LLM activations settle into semi stablestates that can be identified and perturbed to induce state transitions.
Usingdimensionality reduction techniques, we projected activations from safe andjailbroken responses to reveal latent subspaces in lower dimensional spaces.
Wethen derived a perturbation vector that when applied to safe representations,shifted the model towards a jailbreak state.
Our results demonstrate that thiscausal intervention results in statistically significant jailbreak responses ina subset of prompts.
Next, we probed how these perturbations propagate throughthe model's layers, testing whether the induced state change remains localizedor cascades throughout the network.
Our findings indicate that targetedperturbations induced distinct shifts in activations and model responses.
Ourapproach paves the way for potential proactive defenses, shifting fromtraditional guardrail based methods to preemptive, model agnostic techniquesthat neutralize adversarial states at the representation level.
Published on arXiv on: 2025-03-12T04:59:22Z