Trading Inference-Time Compute for Adversarial Robustness
This report outlines the safety work carried out prior to releasing OpenAI o1 and o1-mini, including external red teaming and frontier risk evaluations according to our Preparedness Framework.
Advancing red teaming with people and AI
A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering.
We are introducing OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user.
Advancing cost-efficient reasoning