Link: http://arxiv.org/abs/2505.20809v1
PDF Link: http://arxiv.org/pdf/2505.20809v1
Summary: Steering methods for language models (LMs) seek to provide fine-grained andinterpretable control over model generations by variously changing modelinputs, weights, or representations to adjust behavior.
Recent work has shownthat adjusting weights or representations is often less effective than steeringby prompting, for instance when wanting to introduce or suppress a particularconcept.
We demonstrate how to improve representation steering via our newReference-free Preference Steering (RePS), a bidirectionalpreference-optimization objective that jointly does concept steering andsuppression.
We train three parameterizations of RePS and evaluate them onAxBench, a large-scale model steering benchmark.
On Gemma models with sizesranging from 2B to 27B, RePS outperforms all existing steering methods trainedwith a language modeling objective and substantially narrows the gap withprompting -- while promoting interpretability and minimizing parameter count.
In suppression, RePS matches the language-modeling objective on Gemma-2 andoutperforms it on the larger Gemma-3 variants while remaining resilient toprompt-based jailbreaking attacks that defeat prompting.
Overall, our resultssuggest that RePS provides an interpretable and robust alternative to promptingfor both steering and suppression.
Published on arXiv on: 2025-05-27T07:16:40Z