Skip to content
arxiv papers 1 min read

CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

Link: http://arxiv.org/abs/2508.12535v1

PDF Link: http://arxiv.org/pdf/2508.12535v1

Summary: Sparse Autoencoders (SAEs) can extract interpretable features from largelanguage models (LLMs) without supervision.

However, their effectiveness indownstream steering tasks is limited by the requirement for contrastivedatasets or large activation storage.

To address these limitations, we proposeCorrSteer, which selects features by correlating sample correctness with SAEactivations from generated tokens at inference time.

This approach uses onlyinference-time activations to extract more relevant features, thereby avoidingspurious correlations.

It also obtains steering coefficients from averageactivations, automating the entire pipeline.

Our method shows improved taskperformance on QA, bias mitigation, jailbreaking prevention, and reasoningbenchmarks on Gemma 2 2B and LLaMA 3.

1 8B, notably achieving a +4.

1%improvement in MMLU performance and a +22.

9% improvement in HarmBench with only4000 samples.

Selected features demonstrate semantically meaningful patternsaligned with each task's requirements, revealing the underlying capabilitiesthat drive performance.

Our work establishes correlationbased selection as aneffective and scalable approach for automated SAE steering across languagemodel applications.

Published on arXiv on: 2025-08-18T00:01:42Z