Link: http://arxiv.org/abs/2507.02799v1
PDF Link: http://arxiv.org/pdf/2507.02799v1
Summary: Reasoning Language Models (RLMs) have gained traction for their ability toperform complex, multi-step reasoning tasks through mechanisms such asChain-of-Thought (CoT) prompting or fine-tuned reasoning traces.
While thesecapabilities promise improved reliability, their impact on robustness to socialbiases remains unclear.
In this work, we leverage the CLEAR-Bias benchmark,originally designed for Large Language Models (LLMs), to investigate theadversarial robustness of RLMs to bias elicitation.
We systematically evaluatestate-of-the-art RLMs across diverse sociocultural dimensions, using anLLM-as-a-judge approach for automated safety scoring and leveraging jailbreaktechniques to assess the strength of built-in safety mechanisms.
Our evaluationaddresses three key questions: (i) how the introduction of reasoningcapabilities affects model fairness and robustness; (ii) whether modelsfine-tuned for reasoning exhibit greater safety than those relying on CoTprompting at inference time; and (iii) how the success rate of jailbreakattacks targeting bias elicitation varies with the reasoning mechanismsemployed.
Our findings reveal a nuanced relationship between reasoningcapabilities and bias safety.
Surprisingly, models with explicit reasoning,whether via CoT prompting or fine-tuned reasoning traces, are generally morevulnerable to bias elicitation than base models without such mechanisms,suggesting reasoning may unintentionally open new pathways for stereotypereinforcement.
Reasoning-enabled models appear somewhat safer than thoserelying on CoT prompting, which are particularly prone to contextual reframingattacks through storytelling prompts, fictional personas, or reward-shapedinstructions.
These results challenge the assumption that reasoning inherentlyimproves robustness and underscore the need for more bias-aware approaches toreasoning design.
Published on arXiv on: 2025-07-03T17:01:53Z