Link: http://arxiv.org/abs/2512.02807v1
PDF Link: http://arxiv.org/pdf/2512.02807v1
Summary: Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases.
In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations.
Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions.
Empirically, stable rank achieves 84.
04% accuracy on RewardBench and improves task accuracy by an average of 11.
3 percentage points over greedy decoding via Best-of-N sampling.
Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning.
Without external supervision, SR-GRPO improves Qwen2.
5-1.
5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines.
Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.
Published on arXiv on: 2025-12-02T14:21:29Z