STAIR: Improving Safety Alignment with Introspective Reasoning

Link: http://arxiv.org/abs/2502.02384v1

PDF Link: http://arxiv.org/pdf/2502.02384v1

Summary: Ensuring the safety and harmlessness of Large Language Models (LLMs) hasbecome equally critical as their performance in applications.

However, existingsafety alignment methods typically suffer from safety-performance trade-offsand the susceptibility to jailbreak attacks, primarily due to their reliance ondirect refusals for malicious queries.

In this paper, we propose STAIR, a novelframework that integrates SafeTy Alignment with Itrospective Reasoning.

Weenable LLMs to identify safety risks through step-by-step analysis byself-improving chain-of-thought (CoT) reasoning with safety awareness.

STAIRfirst equips the model with a structured reasoning capability and then advancessafety alignment via iterative preference optimization on step-level reasoningdata generated using our newly proposed Safety-Informed Monte Carlo Tree Search(SI-MCTS).

We further train a process reward model on this data to guidetest-time searches for improved responses.

Extensive experiments show thatSTAIR effectively mitigates harmful outputs while better preservinghelpfulness, compared to instinctive alignment strategies.

With test-timescaling, STAIR achieves a safety performance comparable to Claude-3.

5 againstpopular jailbreak attacks.

Relevant resources in this work are available athttps://github.

com/thu-ml/STAIR.

Published on arXiv on: 2025-02-04T15:02:55Z