Understanding Refusal in Language Models with Sparse Autoencoders

Link: http://arxiv.org/abs/2505.23556v1

PDF Link: http://arxiv.org/pdf/2505.23556v1

Summary: Refusal is a key safety behavior in aligned language models, yet the internalmechanisms driving refusals remain opaque.

In this work, we conduct amechanistic study of refusal in instruction-tuned LLMs using sparseautoencoders to identify latent features that causally mediate refusalbehaviors.

We apply our method to two open-source chat models and intervene onrefusal-related features to assess their influence on generation, validatingtheir behavioral impact across multiple harmful datasets.

This enables afine-grained inspection of how refusal manifests at the activation level andaddresses key research questions such as investigating upstream-downstreamlatent relationship and understanding the mechanisms of adversarialjailbreaking techniques.

We also establish the usefulness of refusal featuresin enhancing generalization for linear probes to out-of-distributionadversarial samples in classification tasks.

We open source our code inhttps://github.

com/wj210/refusal_sae.

Published on arXiv on: 2025-05-29T15:33:39Z