Skip to content
arxiv papers 1 min read

LLMs Encode Harmfulness and Refusal Separately

Link: http://arxiv.org/abs/2507.11878v1

PDF Link: http://arxiv.org/pdf/2507.11878v1

Summary: LLMs are trained to refuse harmful instructions, but do they truly understandharmfulness beyond just refusing? Prior work has shown that LLMs' refusalbehaviors can be mediated by a one-dimensional subspace, i.

e.

, a refusaldirection.

In this work, we identify a new dimension to analyze safetymechanisms in LLMs, i.

e.

, harmfulness, which is encoded internally as aseparate concept from refusal.

There exists a harmfulness direction that isdistinct from the refusal direction.

As causal evidence, steering along theharmfulness direction can lead LLMs to interpret harmless instructions asharmful, but steering along the refusal direction tends to elicit refusalresponses directly without reversing the model's judgment on harmfulness.

Furthermore, using our identified harmfulness concept, we find that certainjailbreak methods work by reducing the refusal signals without reversing themodel's internal belief of harmfulness.

We also find that adversariallyfinetuning models to accept harmful instructions has minimal impact on themodel's internal belief of harmfulness.

These insights lead to a practicalsafety application: The model's latent harmfulness representation can serve asan intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducingover-refusals that is robust to finetuning attacks.

For instance, our LatentGuard achieves performance comparable to or better than Llama Guard 3 8B, adedicated finetuned safeguard model, across different jailbreak methods.

Ourfindings suggest that LLMs' internal understanding of harmfulness is morerobust than their refusal decision to diverse input instructions, offering anew perspective to study AI safety

Published on arXiv on: 2025-07-16T03:48:03Z