Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders

Link: http://arxiv.org/abs/2505.14536v1

PDF Link: http://arxiv.org/pdf/2505.14536v1

Summary: Large language models (LLMs) are now ubiquitous in user-facing applications,yet they still generate undesirable toxic outputs, including profanity,vulgarity, and derogatory remarks.

Although numerous detoxification methodsexist, most apply broad, surface-level fixes and can therefore easily becircumvented by jailbreak attacks.

In this paper we leverage sparseautoencoders (SAEs) to identify toxicity-related directions in the residualstream of models and perform targeted activation steering using thecorresponding decoder vectors.

We introduce three tiers of steeringaggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealingtrade-offs between toxicity reduction and language fluency.

At strongersteering strengths, these causal interventions surpass competitive baselines inreducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2Small depending on the aggressiveness.

Crucially, standard NLP benchmark scoresupon steering remain stable, indicating that the model's knowledge and generalabilities are preserved.

We further show that feature-splitting in wider SAEshampers safety interventions, underscoring the importance of disentangledfeature learning.

Our findings highlight both the promise and the currentlimitations of SAE-based causal interventions for LLM detoxification, furthersuggesting practical guidelines for safer language-model deployment.

Published on arXiv on: 2025-05-20T15:55:31Z