Link: http://arxiv.org/abs/2505.23556v1
PDF Link: http://arxiv.org/pdf/2505.23556v1
Summary: Refusal is a key safety behavior in aligned language models, yet the internalmechanisms driving refusals remain opaque.
In this work, we conduct amechanistic study of refusal in instruction-tuned LLMs using sparseautoencoders to identify latent features that causally mediate refusalbehaviors.
We apply our method to two open-source chat models and intervene onrefusal-related features to assess their influence on generation, validatingtheir behavioral impact across multiple harmful datasets.
This enables afine-grained inspection of how refusal manifests at the activation level andaddresses key research questions such as investigating upstream-downstreamlatent relationship and understanding the mechanisms of adversarialjailbreaking techniques.
We also establish the usefulness of refusal featuresin enhancing generalization for linear probes to out-of-distributionadversarial samples in classification tasks.
We open source our code inhttps://github.
com/wj210/refusal_sae.
Published on arXiv on: 2025-05-29T15:33:39Z