Skip to main content
Publication
Introducing the SWE-Lancer benchmark

Can frontier LLMs earn $1 million from real-world freelance software engineering?

Publication
OpenAI o3-mini System Card

This report outlines the safety work carried out for the OpenAI o3-mini model, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

Publication
Trading inference-time compute for adversarial robustness

Trading Inference-Time Compute for Adversarial Robustness

Publication
Sora System Card

Sora is OpenAI’s video generation model, designed to take text, image, and video inputs and generate a new video as an output. Sora builds on learnings from DALL-E and GPT models, and is designed to give people expanded tools for storytelling and creative expression.

Publication
OpenAI o1 System Card

This report outlines the safety work carried out prior to releasing OpenAI o1 and o1-mini, including external red teaming and frontier risk evaluations according to our Preparedness Framework.

Publication
Advancing red teaming with people and AI

Advancing red teaming with people and AI

Publication
Introducing SimpleQA

A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

Publication
Evaluating fairness in ChatGPT

We've analyzed how ChatGPT responds to users based on their name, using AI research assistants to protect privacy.

Publication
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering.