Link: http://arxiv.org/abs/2502.04040v1
PDF Link: http://arxiv.org/pdf/2502.04040v1
Summary: Training safe LLMs is one of the most critical research challenge.
However,the commonly used method, Refusal Training (RT), struggles to generalizeagainst various OOD jailbreaking attacks.
Many safety training methods havebeen proposed to address this issue.
While they offer valuable insights, we aimto complement this line of research by investigating whether OOD attacks trulyexceed the capability of RT model.
Conducting evaluation with BoN, we observesignificant improvements on generalization as N increases.
This underscoresthat the model possesses sufficient safety-related latent knowledge, but RTfails to consistently elicit this knowledge when addressing OOD attacks.
Further analysis based on domain adaptation reveals that training with directrefusal causes model to rely on superficial shortcuts, resulting in learning ofnon-robust representation mappings.
Based on our findings, we propose trainingmodel to perform safety reasoning for each query.
Reasoning supervisionencourages model to perform more computations, explicitly eliciting and usinglatent knowledge through reasoning.
To achieve this, we synthesize reasoningsupervision based on pre-guidelines, training the model to reason in alignmentwith them, thereby effectively eliciting and utilizing latent knowledge fromdiverse perspectives.
Extensive experiments show that our method significantlyimproves generalization performance against OOD attacks.
Published on arXiv on: 2025-02-06T13:01:44Z