Link: http://arxiv.org/abs/2509.09660v1
PDF Link: http://arxiv.org/pdf/2509.09660v1
Summary: Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each tokenthrough a subset of specialized Feed-Forward Networks (FFN), known as experts.
We present SteerMoE, a framework for steering MoE models by detecting andcontrolling behavior-linked experts.
Our detection method identifies expertswith distinct activation patterns across paired inputs exhibiting contrastingbehaviors.
By selectively (de)activating such experts during inference, wecontrol behaviors like faithfulness and safety without retraining or modifyingweights.
Across 11 benchmarks and 6 LLMs, our steering raises safety by up to+20% and faithfulness by +27%.
In adversarial attack mode, it drops safety by-41% alone, and -100% when combined with existing jailbreak methods, bypassingall safety guardrails and exposing a new dimension of alignment faking hiddenwithin experts.
Published on arXiv on: 2025-09-11T17:55:09Z