Link: http://arxiv.org/abs/2502.02384v1
PDF Link: http://arxiv.org/pdf/2502.02384v1
Summary: Ensuring the safety and harmlessness of Large Language Models (LLMs) hasbecome equally critical as their performance in applications.
However, existingsafety alignment methods typically suffer from safety-performance trade-offsand the susceptibility to jailbreak attacks, primarily due to their reliance ondirect refusals for malicious queries.
In this paper, we propose STAIR, a novelframework that integrates SafeTy Alignment with Itrospective Reasoning.
Weenable LLMs to identify safety risks through step-by-step analysis byself-improving chain-of-thought (CoT) reasoning with safety awareness.
STAIRfirst equips the model with a structured reasoning capability and then advancessafety alignment via iterative preference optimization on step-level reasoningdata generated using our newly proposed Safety-Informed Monte Carlo Tree Search(SI-MCTS).
We further train a process reward model on this data to guidetest-time searches for improved responses.
Extensive experiments show thatSTAIR effectively mitigates harmful outputs while better preservinghelpfulness, compared to instinctive alignment strategies.
With test-timescaling, STAIR achieves a safety performance comparable to Claude-3.
5 againstpopular jailbreak attacks.
Relevant resources in this work are available athttps://github.
com/thu-ml/STAIR.
Published on arXiv on: 2025-02-04T15:02:55Z