Introducing SimpleQA
A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.
An open problem in artificial intelligence is how to train models that produce responses that are factually correct. Current language models sometimes produce false outputs or answers unsubstantiated by evidence, a problem known as “hallucinations”. Language models that generate more accurate responses with fewer hallucinations are more trustworthy and can be used in a broader range of applications. To measure the factuality of language models, we are open-sourcing(opens in a new window) a new benchmark called SimpleQA.
About the SimpleQA benchmark
Factuality is a complicated topic because it is hard to measure—evaluating the factuality of any given arbitrary claim is challenging, and language models can generate long completions that contain dozens of factual claims. In SimpleQA, we will focus on short, fact-seeking queries, which reduces the scope of the benchmark but makes measuring factuality much more tractable.
With SimpleQA, our goal was to create a dataset with the following properties:
High correctness. Reference answers to questions are supported by sources from two independent AI trainers, and questions were written in such a way that the predicted answers are easy to grade.
Diversity. SimpleQA covers a wide range of topics, from science and technology to TV shows and video games.
Challenging for frontier models. Compared to older benchmarks such as TriviaQA(opens in a new window) (2017) or NQ(opens in a new window) (2019), which have become saturated, SimpleQA was created to be a greater challenge for frontier models (e.g., GPT-4o scores less than 40%).
Good researcher UX. SimpleQA is intended to be fast and simple to run due to its concise questions and answers. Grading is also efficient whether through the OpenAI API or another frontier model API. Additionally, with 4,326 questions, SimpleQA should have relatively low variance as an evaluation benchmark.
We hired AI trainers to browse the web and create short, fact-seeking questions and corresponding answers. To be included in the dataset, each question had to meet a strict set of criteria: it must have a single, indisputable answer for easy grading; the answer to the question should not change over time; and most questions had to induce hallucinations from either GPT-4o or GPT-3.5. To further improve the quality of the dataset, a second, independent AI trainer answered each question without seeing the original response. Only questions where both AI trainers’ answers agreed were included.
As a final verification of quality, we had a third AI trainer answer a random sample of 1,000 questions from the dataset. We found that the third AI trainer’s answer matched the original agreed answers 94.4% of the time, with a 5.6% disagreement rate. We then manually inspected these examples, and found that 2.8% of the 5.6% of disagreements were due to grader false negatives or human errors from the third trainer (e.g., incomplete answers or misinterpreting sources), and the remaining 2.8% were due to real issues with the question (e.g., ambiguous questions, or different websites giving conflicting answers). Hence, we estimate the inherent error rate of this dataset to be approximately 3%.
Question diversity in SimpleQA
The pie chart below shows the diversity of topics in the SimpleQA benchmark, with examples of each question shown if you hover over the pie on the pie chart.
Using SimpleQA to compare language models
To grade questions, we use a prompted ChatGPT classifier that sees both the predicted answer from the model and the ground-truth answer, and then grades the predicted answer as either “correct”, “incorrect”, or “not attempted”.
A definition and corresponding examples for each grade are shown in the table below.
Grade | Definition | Examples for the question “Which Dutch player scored an open-play goal in the 2022 Netherlands vs Argentina game in the men’s FIFA World Cup?” (Answer: Wout Weghorst) |
---|---|---|
“Correct” | The predicted answer fully contains the ground-truth answer without contradicting the reference answer. |
|
“Incorrect” | The predicted answer contradicts the ground-truth answer in any way, even if the contradiction is hedged. |
|
“Not attempted” | The ground truth target is not fully given in the answer, and there are no contradictions with the reference answer. |
|
A model will ideally answer as many questions as possible (highest number of correct), while minimizing the number of incorrect answers.
Using this classification, we can then measure the performance of several OpenAI models without browsing, including gpt-4o-mini, o1-mini, gpt-4o, and o1-preview. As expected, gpt-4o-mini and o1-mini answer fewer questions correctly compared to gpt-4o and o1-preview, likely because smaller models typically have less world knowledge. We also see that o1-mini and o1-preview, which are designed to spend more time thinking, choose to "not attempt" questions more often than gpt-4o-mini and gpt-4o. This may be because they can use their reasoning capacity to recognize when they don’t know the answer to a question, instead of hallucinating.
Using SimpleQA to measure the calibration of large language models
A factuality benchmark like SimpleQA also allows us to measure the scientific phenomenon known as calibration, or whether language models “know what they know.” One way to measure calibration is to directly ask the language model to state its confidence in its answer using the prompt: “Please give your best guess, along with your confidence as a percentage that that is the correct answer.” Then we can plot the correlation between the stated confidence of the model, and how accurate the model actually was. A perfectly calibrated model would have the same actual accuracy as stated confidence. For instance, on all prompts where the model stated a confidence of 75%, the accuracy would be 75% for a perfectly calibrated model.
This result is shown in the figure below. The positive correlation between stated confidence and accuracy is a reassuring sign that models have some notion of confidence. We see that o1-preview is more calibrated than o1-mini, and gpt4o is more calibrated than gpt4o-mini, which is consistent with prior work(opens in a new window) showing that larger models are more calibrated. However, the fact that performance is well below the line y=x means that models consistently overstate their confidence. Hence, there is a lot of room to improve the calibration of large language models in terms of stated confidence.
Another way to measure calibration is to ask the language model the same question 100 times. Since language models may produce different answers upon repeated attempts, we can assess whether the frequency of a particular answer corresponds to its correctness. Higher frequency typically indicates that the model is more confident in its answers, as the model is giving the same answer repeatedly. A well-calibrated model would have the same actual accuracy as frequency.
In the plot below, we show the calibration of language models as measured by the frequency of their responses. Here, we simply use string match to group together different answers from the language model. We see across all models that accuracy increases with frequency, and that o1-preview has the highest level of calibration, where the frequency of the response is roughly equivalent to the accuracy of the response. Similar to calibration via stated confidence plot above, we again see o1-preview is more calibrated than o1-mini, and gpt4o is more calibrated than o1-mini.
Conclusions
SimpleQA is a simple but challenging benchmark for evaluating the factuality of frontier models. A main limitation in SimpleQA is its scope—while SimpleQA is accurate it only measures factuality under the constrained setting of short, fact-seeking queries with a single, verifiable answer. Whether the ability to provide factual short answers correlates with the ability to write lengthy responses filled with numerous facts remains an open research question. We hope that open-sourcing SimpleQA drives the research on more trustworthy and reliable AI forward, and we invite researchers to evaluate the factuality of language models with it, and to provide us feedback.