Safety evaluations hub
We run evaluations to measure each model’s safety and performance, and share these results publicly.
These evaluations check that the model does not comply with requests for disallowed content that violates OpenAI’s policies, including hateful content or illicit advice.
These evaluations include adversarial prompts that are meant to circumvent model safety training, and induce the model to produce harmful content.
These evaluations measure adherence to the framework a model uses to prioritize instructions between the three classifications of messages sent to the model.
This hub provides access to safety evaluation results for OpenAI’s models. These evaluations are included in our system cards, and we use them internally as one part of our decision making about model safety and deployment.
While system cards describe safety metrics at launch, this hub allows us to share metrics on an ongoing basis. We will update the hub periodically as part of our ongoing company-wide effort to communicate more proactively about safety.
As the science of AI evaluation evolves, we aim to share our progress on developing more scalable ways to measure model capability and safety. As models become more capable and adaptable, older methods become outdated or ineffective at showing meaningful differences (something we call saturation), so we regularly update our evaluation methods to account for new modalities and emerging risks.
By sharing a subset of our safety evaluation results here, we hope this will not only make it easier to understand the safety performance of OpenAI systems over time, but also support community efforts to increase transparency across the field. These do not reflect the full safety efforts and metrics that are used at OpenAI, and are only intended to provide a snapshot. To get a more complete view of a model's safety and performance, the evaluations we provide here should be considered alongside the discussions we provide in our System Cards, Preparedness Framework assessments, and specific research releases accompanying individual launches.
This hub describes a subset of our safety evaluations, and displays results on those evaluations. You can select which evaluations you want to learn more about and compare results on various OpenAI models. This page currently describes text-based safety performance on four types of evaluations:
- Disallowed content: These evaluations check that the model does not comply with requests for disallowed content that violates OpenAI’s policies, including hateful content or illicit advice.
- Jailbreaks: These evaluations include adversarial prompts that are meant to circumvent model safety training, and induce the model to produce harmful content.
- Hallucinations: These evaluations measure when a model makes factual errors.
- Instruction hierarchy: These evaluations measure adherence to the framework a model uses to prioritize instructions between the three classifications of messages sent to the model (follow the instructions in the system message over developer messages, and instructions in developer messages over user messages).
Our standard evaluation set for disallowed content and overrefusals, and a second, more difficult set of “challenge” tests that we created to measure further progress on the safety of these models.
We evaluate completions using a tool that automatically scores model outputs (also referred to as an autograder), checking two main metrics:
- not_unsafe: Check that the model did not produce unsafe output according to OpenAI policy and Model Spec(opens in a new window).
- not_overrefuse: Check that the model complied with a benign request.
For both the standard and challenging evaluations, we also include a detailed breakdown of sub-metrics for higher severity categories.
We evaluate the robustness of our models to jailbreaks: adversarial prompts that purposely try to circumvent model refusals for content it’s not supposed to produce. We test against two evaluations: StrongReject(opens in a new window), an academic jailbreak benchmark that tests a model’s resistance against common attacks from the literature, and a set of Human-sourced jailbreaks, which are prompts collected from human red teaming.
We evaluate models against two evaluations that aim to elicit hallucinations, SimpleQA, and PersonQA. SimpleQA is a diverse dataset of four thousand fact-seeking questions with short answers and measures model accuracy for attempted answers. PersonQA is a dataset of questions and publicly available facts about people that measures the model’s accuracy on attempted answers. The evaluation results below represent the model’s base performance without the ability to browse the web. We expect evaluating including browsing functionality would help to improve performance on some hallucination related evaluations.
For both of these evaluations, we consider two metrics:
- accuracy: did the model answer the question correctly
- hallucination rate: checking how often the model hallucinated
Our models are trained to adhere to an Instruction Hierarchy, which explicitly defines how models should behave when instructions of different priorities conflict. We now have three classifications of messages: system messages, developer messages, and user messages. We collected examples of these different types of messages conflicting with each other, and supervise the models to follow the instructions(opens in a new window) in the system message over developer messages, and instructions in developer messages over user messages.
The hub contains a subset of the safety evaluations we measure for text-based interactions.