OpenAI o1 System Card
This report outlines the safety work carried out prior to releasing OpenAI o1-preview and o1-mini, including external red teaming and frontier risk evaluations according to our Preparedness Framework.
OpenAI o1 Scorecard
Key Areas of Evaluation
- Disallowed content
- Training Data Regurgitation
- Hallucinations
- Bias
Preparedness Scorecard
- CBRNMedium
- Model AutonomyLow
- CybersecurityLow
- PersuasionMedium
Scorecard ratings
- Low
- Medium
- High
- Critical
Improved or comparable performance from previous flagship OpenAI model
Introduction
We thoroughly evaluate new models for potential risks and build in appropriate safeguards before deploying them in ChatGPT or the API. We’re publishing the OpenAI o1 System Card together with the Preparedness Framework scorecard to provide a rigorous safety assessment of o1, including what we’ve done to address current safety challenges and frontier risks.
Building on the safety evaluations and mitigations we developed for past models, we’ve focused additional efforts on o1's advanced reasoning capabilities. We used both public and internal evaluations to measure risks such as disallowed content, demographic fairness, hallucination tendency, and dangerous capabilities. Based on these evaluations, we’ve implemented safeguards at both the model and system levels, such as blocklists and safety classifiers, to mitigate these risks effectively.
Our findings indicate that o1's advanced reasoning improves safety by making the model more resilient to generating harmful content because it can reason about our safety rules in context and apply them more effectively. Under our Preparedness Framework, o1 is rated a "medium" overall risk rating and safe to deploy because it doesn't enable anything beyond what's possible with existing resources, with "low" risk level in Cybersecurity and Model Autonomy, and "medium" risk level in CBRN and Persuasion.
OpenAI’s Safety Advisory Group, the Safety & Security Committee and the OpenAI Board reviewed the safety and security protocols applied to o1 as well as the in-depth Preparedness evaluation, resulting in the approval for o1’s release.
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1-preview and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
We invite you to read the details of this work in the report below.