Link: http://arxiv.org/abs/2505.14536v1
PDF Link: http://arxiv.org/pdf/2505.14536v1
Summary: Large language models (LLMs) are now ubiquitous in user-facing applications,yet they still generate undesirable toxic outputs, including profanity,vulgarity, and derogatory remarks.
Although numerous detoxification methodsexist, most apply broad, surface-level fixes and can therefore easily becircumvented by jailbreak attacks.
In this paper we leverage sparseautoencoders (SAEs) to identify toxicity-related directions in the residualstream of models and perform targeted activation steering using thecorresponding decoder vectors.
We introduce three tiers of steeringaggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealingtrade-offs between toxicity reduction and language fluency.
At strongersteering strengths, these causal interventions surpass competitive baselines inreducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2Small depending on the aggressiveness.
Crucially, standard NLP benchmark scoresupon steering remain stable, indicating that the model's knowledge and generalabilities are preserved.
We further show that feature-splitting in wider SAEshampers safety interventions, underscoring the importance of disentangledfeature learning.
Our findings highlight both the promise and the currentlimitations of SAE-based causal interventions for LLM detoxification, furthersuggesting practical guidelines for safer language-model deployment.
Published on arXiv on: 2025-05-20T15:55:31Z