Link: http://arxiv.org/abs/2412.16720v1
PDF Link: http://arxiv.org/pdf/2412.16720v1
Summary: The o1 model series is trained with large-scale reinforcement learning toreason using chain of thought.
These advanced reasoning capabilities providenew avenues for improving the safety and robustness of our models.
Inparticular, our models can reason about our safety policies in context whenresponding to potentially unsafe prompts, through deliberative alignment.
Thisleads to state-of-the-art performance on certain benchmarks for risks such asgenerating illicit advice, choosing stereotyped responses, and succumbing toknown jailbreaks.
Training models to incorporate a chain of thought beforeanswering has the potential to unlock substantial benefits, while alsoincreasing potential risks that stem from heightened intelligence.
Our resultsunderscore the need for building robust alignment methods, extensivelystress-testing their efficacy, and maintaining meticulous risk managementprotocols.
This report outlines the safety work carried out for the OpenAI o1and OpenAI o1-mini models, including safety evaluations, external red teaming,and Preparedness Framework evaluations.
Published on arXiv on: 2024-12-21T18:04:31Z