PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Link: http://arxiv.org/abs/2507.02332v1

PDF Link: http://arxiv.org/pdf/2507.02332v1

Summary: This paper investigates privacy jailbreaking in LLMs via steering, focusingon whether manipulating activations can bypass LLM alignment and alter responsebehaviors to privacy related queries (e.

g.

, a certain public figure's sexualorientation).

We begin by identifying attention heads predictive of refusalbehavior for private attributes (e.

g.

, sexual orientation) using lightweightlinear probes trained with privacy evaluator labels.

Next, we steer theactivations of a small subset of these attention heads guided by the trainedprobes to induce the model to generate non-refusal responses.

Our experimentsshow that these steered responses often disclose sensitive attribute details,along with other private information about data subjects such as life events,relationships, and personal histories that the models would typically refuse toproduce.

Evaluations across four LLMs reveal jailbreaking disclosure rates ofat least 95%, with more than 50% on average of these responses revealing truepersonal information.

Our controlled study demonstrates that privateinformation memorized in LLMs can be extracted through targeted manipulation ofinternal activations.

Published on arXiv on: 2025-07-03T05:50:50Z