SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

Link: http://arxiv.org/abs/2505.14667v1

PDF Link: http://arxiv.org/pdf/2505.14667v1

Summary: Large Reasoning Models (LRMs) have become powerful tools for complex problemsolving, but their structured reasoning pathways can lead to unsafe outputswhen exposed to harmful prompts.

Existing safety alignment methods reduceharmful outputs but can degrade reasoning depth, leading to significanttrade-offs in complex, multi-step tasks, and remain vulnerable to sophisticatedjailbreak attacks.

To address this, we introduce SAFEPATH, a lightweightalignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer atthe start of their reasoning, in response to harmful prompts, while leaving therest of the reasoning process unsupervised.

Empirical results across multiplebenchmarks indicate that SAFEPATH effectively reduces harmful outputs whilemaintaining reasoning performance.

Specifically, SAFEPATH reduces harmfulresponses by up to 90.

0% and blocks 83.

3% of jailbreak attempts in theDeepSeek-R1-Distill-Llama-8B model, while requiring 295.

9x less compute thanDirect Refusal and 314.

1x less than SafeChain.

We further introduce a zero-shotvariant that requires no fine-tuning.

In addition, we provide a comprehensiveanalysis of how existing methods in LLMs generalize, or fail, when applied toreasoning-centric models, revealing critical gaps and new directions for saferAI.

Published on arXiv on: 2025-05-20T17:54:54Z