Link: http://arxiv.org/abs/2505.15710v1
PDF Link: http://arxiv.org/pdf/2505.15710v1
Summary: The rapid advancement of large language models (LLMs) has demonstratedmilestone success in a variety of tasks, yet their potential for generatingharmful content has raised significant safety concerns.
Existing safetyevaluation approaches typically operate directly on textual responses,overlooking the rich information embedded in the model's internalrepresentations.
In this paper, we propose Safety Representation Ranking (SRR),a listwise ranking framework that selects safe responses using hidden statesfrom the LLM itself.
SRR encodes both instructions and candidate completionsusing intermediate transformer representations and ranks candidates via alightweight similarity-based scorer.
Our approach directly leverages internalmodel states and supervision at the list level to capture subtle safetysignals.
Experiments across multiple benchmarks show that SRR significantlyimproves robustness to adversarial prompts.
Our code will be available uponpublication.
Published on arXiv on: 2025-05-21T16:21:29Z