Skip to main content
OpenAI

Last updated: May 14, 2025

Safety evaluations hub

We run evaluations to measure each model’s safety and performance, and share these results publicly.

Disallowed content

These evaluations check that the model does not comply with requests for disallowed content that violates OpenAI’s policies, including hateful content or illicit advice.

Jailbreaks

These evaluations include adversarial prompts that are meant to circumvent model safety training, and induce the model to produce harmful content.

Hallucinations

These evaluations measure when a model makes factual errors.

Instruction hierarchy

These evaluations measure adherence to the framework a model uses to prioritize instructions between the three classifications of messages sent to the model.

This hub provides access to safety evaluation results for OpenAI’s models. These evaluations are included in our system cards, and we use them internally as one part of our decision making about model safety and deployment.

While system cards describe safety metrics at launch, this hub allows us to share metrics on an ongoing basis. We will update the hub periodically as part of our ongoing company-wide effort to communicate more proactively about safety.

As the science of AI evaluation evolves, we aim to share our progress on developing more scalable ways to measure model capability and safety. As models become more capable and adaptable, older methods become outdated or ineffective at showing meaningful differences (something we call saturation), so we regularly update our evaluation methods to account for new modalities and emerging risks.

By sharing a subset of our safety evaluation results here, we hope this will not only make it easier to understand the safety performance of OpenAI systems over time, but also support community efforts to increase transparency across the field. These do not reflect the full safety efforts and metrics that are used at OpenAI, and are only intended to provide a snapshot. To get a more complete view of a model's safety and performance, the evaluations we provide here should be considered alongside the discussions we provide in our System Cards, Preparedness Framework assessments, and specific research releases accompanying individual launches.

How to use this page

This hub describes a subset of our safety evaluations, and displays results on those evaluations. You can select which evaluations you want to learn more about and compare results on various OpenAI models. This page currently describes text-based safety performance on four types of evaluations:

  • Disallowed content: These evaluations check that the model does not comply with requests for disallowed content that violates OpenAI’s policies, including hateful content or illicit advice.
  • Jailbreaks: These evaluations include adversarial prompts that are meant to circumvent model safety training, and induce the model to produce harmful content.
  • Hallucinations: These evaluations measure when a model makes factual errors.
  • Instruction hierarchy: These evaluations measure adherence to the framework a model uses to prioritize instructions between the three classifications of messages sent to the model (follow the instructions in the system message over developer messages, and instructions in developer messages over user messages).

Disallowed content

Our standard evaluation set for disallowed content and overrefusals, and a second, more difficult set of “challenge” tests that we created to measure further progress on the safety of these models.

We evaluate completions using a tool that automatically scores model outputs (also referred to as an autograder), checking two main metrics:

  • not_unsafe: Check that the model did not produce unsafe output according to OpenAI policy and Model Spec⁠(opens in a new window).
  • not_overrefuse: Check that the model complied with a benign request.

For both the standard and challenging evaluations, we also include a detailed breakdown of sub-metrics for higher severity categories.

Jailbreak evaluations

We evaluate the robustness of our models to jailbreaks: adversarial prompts that purposely try to circumvent model refusals for content it’s not supposed to produce. We test against two evaluations: StrongReject(opens in a new window), an academic jailbreak benchmark that tests a model’s resistance against common attacks from the literature, and a set of Human-sourced jailbreaks, which are prompts collected from human red teaming.

An academic jailbreak benchmark that tests a model’s resistance against common attacks from the literature. We calculate goodness@0.1, which is the safety of the model when evaluated against the top 10% of jailbreak techniques per prompt.

Hallucination evaluations

We evaluate models against two evaluations that aim to elicit hallucinations, SimpleQA, and PersonQA. SimpleQA is a diverse dataset of four thousand fact-seeking questions with short answers and measures model accuracy for attempted answers. PersonQA is a dataset of questions and publicly available facts about people that measures the model’s accuracy on attempted answers. The evaluation results below represent the model’s base performance without the ability to browse the web. We expect evaluating including browsing functionality would help to improve performance on some hallucination related evaluations.

For both of these evaluations, we consider two metrics:

  • accuracy: did the model answer the question correctly
  • hallucination rate: checking how often the model hallucinated
A diverse dataset of four thousand fact-seeking questions with short answers and measures model accuracy for attempted answers.

Instruction hierarchy

Our models are trained to adhere to an Instruction Hierarchy, which explicitly defines how models should behave when instructions of different priorities conflict. We now have three classifications of messages: system messages, developer messages, and user messages. We collected examples of these different types of messages conflicting with each other, and supervise the models to follow the instructions(opens in a new window) in the system message over developer messages, and instructions in developer messages over user messages.

To pass this eval, the model must choose to follow the instructions in the highest priority message.

FAQ

The hub contains a subset of the safety evaluations we measure for text-based interactions.