Skip to main content

January 31, 2025

Release

OpenAI o3-mini

Pushing the frontier of cost-effective reasoning.

We’re releasing OpenAI o3-mini, the newest, most cost-efficient model in our reasoning series, available in both ChatGPT and the API today. Previewed in December 2024, this powerful and fast model advances the boundaries of what small models can achieve, delivering exceptional STEM capabilities—with particular strength in science, math, and coding—all while maintaining the low cost and reduced latency of OpenAI o1-mini.

OpenAI o3-mini is our first small reasoning model that supports highly requested developer features including function calling(opens in a new window), Structured Outputs(opens in a new window), and developer messages(opens in a new window), making it production-ready out of the gate. Like OpenAI o1-mini and OpenAI o1-preview, o3-mini will support streaming(opens in a new window). Also, developers can choose between three reasoning effort(opens in a new window) options—low, medium, and high—to optimize for their specific use cases. This flexibility allows o3-mini to “think harder” when tackling complex challenges or prioritize speed when latency is a concern. o3-mini does not support vision capabilities, so developers should continue using OpenAI o1 for visual reasoning tasks. o3-mini is rolling out in the Chat Completions API, Assistants API, and Batch API starting today to select developers in API usage tiers 3-5(opens in a new window).

ChatGPT Plus, Team, and Pro users can access OpenAI o3-mini starting today, with Enterprise access coming in February. o3-mini will replace OpenAI o1-mini in the model picker, offering higher rate limits and lower latency, making it a compelling choice for coding, STEM, and logical problem-solving tasks. As part of this upgrade, we’re tripling the rate limit for Plus and Team users from 50 messages per day with o1-mini to 150 messages per day with o3-mini. Additionally, o3-mini now works with search to find up-to-date answers with links to relevant web sources. This is an early prototype as we work to integrate search across our reasoning models.

Starting today, free plan users can also try OpenAI o3-mini by selecting ‘Reason’ in the message composer or by regenerating a response. This marks the first time a reasoning model has been made available to free users in ChatGPT.

While OpenAI o1 remains our broader general knowledge reasoning model, OpenAI o3-mini provides a specialized alternative for technical domains requiring precision and speed. In ChatGPT, o3-mini uses medium reasoning effort to provide a balanced trade-off between speed and accuracy. All paid users will also have the option of selecting o3-mini-high in the model picker for a higher-intelligence version that takes a little longer to generate responses. Pro users will have unlimited access to both o3-mini and o3-mini-high.

Fast, powerful, and optimized for STEM reasoning

Similar to its OpenAI o1 predecessor, OpenAI o3-mini has been optimized for STEM reasoning. o3-mini with medium reasoning effort matches o1’s performance in math, coding, and science, while delivering faster responses. Evaluations by expert testers showed that o3-mini produces more accurate and clearer answers, with stronger reasoning abilities, than OpenAI o1-mini. Testers preferred o3-mini's responses to o1-mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions. With medium reasoning effort, o3-mini matches the performance of o1 on some of the most challenging reasoning and intelligence evaluations including AIME and GPQA.

Competition Math (AIME 2024)

The bar chart compares accuracy on AIME 2024 competition math questions across AI models. Older models (gray) score lower, while newer ones (yellow) improve. "o3-mini (high)" reaches the highest accuracy at 83.6%, showing significant progress.

Mathematics: With low reasoning effort, OpenAI o3-mini achieves comparable performance with OpenAI o1-mini, while with medium effort, o3-mini achieves comparable performance with o1. Meanwhile, with high reasoning effort, o3-mini outperforms both OpenAI o1-mini and OpenAI o1, where the gray shaded regions show the performance of majority vote (consensus) with 64 samples.

PhD-level Science Questions (GPQA Diamond)

The bar chart compares accuracy on PhD-level science questions (GPQA Diamond) across AI models. Older models (gray) perform lower, while newer ones (yellow) improve. "o3-mini (high)" reaches 77.0% accuracy, showing notable progress over earlier versions.

PhD-level science: On PhD-level biology, chemistry, and physics questions, with low reasoning effort, OpenAI o3-mini achieves performance above OpenAI o1-mini. With high effort, o3-mini achieves comparable performance with o1.

FrontierMath

A black grid with multiple rows and columns, separated by thin white lines, creating a structured and organized layout.

Research-level mathematics: OpenAI o3-mini with high reasoning performs better than its predecessor on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3-mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.

Competition Code (Codeforces)

The bar chart compares Elo ratings on Codeforces competition coding tasks across AI models. Older models (gray) score lower, while newer ones (yellow) improve. "o3-mini (high)" reaches 2073 Elo, showing significant progress over previous versions.

Competition coding: On Codeforces competitive programming, OpenAI o3-mini achieves progressively higher Elo scores with increased reasoning effort, all outperforming o1-mini. With medium reasoning effort, it matches o1’s performance.

Software Engineering (SWE-bench Verified)

The bar chart compares accuracy on SWE-bench Verified software engineering tasks across AI models. Older models (gray) perform lower, while "o3-mini (high)" (yellow) achieves the highest accuracy at 48.9%, showing improvement over previous versions.

Software engineering: o3-mini is our highest performing released model on SWEbench-verified. For additional datapoints on SWE-bench Verified results with high reasoning effort, including with the open-source Agentless scaffold (39%) and an internal tools scaffold (61%), see our system card⁠.

LiveBench Coding

The table compares AI models on coding tasks, showing performance metrics and evaluation scores. It highlights differences in accuracy and efficiency, with some models outperforming others in specific benchmarks.

LiveBench coding: OpenAI o3-mini surpasses o1-high even at medium reasoning effort, highlighting its efficiency in coding tasks. At high reasoning effort, o3-mini further extends its lead, achieving significantly stronger performance across key metrics.

General knowledge

The table titled "Category Evals" compares AI models across different evaluation categories, showing performance metrics. It highlights differences in accuracy, efficiency, and effectiveness, with some models outperforming others in specific tasks.

General knowledge: o3-mini outperforms o1-mini in knowledge evaluations across general knowledge domains.

Human Preference Evaluation

The chart compares win rates for STEM and non-STEM tasks across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in both categories, with a higher win rate for STEM tasks.
The chart compares win rates under time constraints and major error rates across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in win rate and significantly reduces major errors.

Human preference evaluation: Evaluations by external expert testers also show that OpenAI o3-mini produces more accurate and clearer answers, with stronger reasoning abilities than OpenAI o1-mini, especially for STEM. Testers preferred o3-mini's responses to o1-mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions.

Model speed and performance

With intelligence comparable to OpenAI o1, OpenAI o3-mini delivers faster performance and improved efficiency. Beyond the STEM evaluations highlighted above, o3-mini demonstrates superior results in additional math and factuality evaluations with medium reasoning effort. In A/B testing, o3-mini delivered responses 24% faster than o1-mini, with an average response time of 7.7 seconds compared to 10.16 seconds.

Latency comparison between o1-mini and o3-mini (medium)

The bar chart compares latency between "o1-mini" and "o3-mini (medium)" models. "o3-mini" (lighter yellow) has lower latency, indicating faster response times, while "o1-mini" (darker yellow) takes longer on average.

Latency: o3-mini has an avg 2500ms faster time to first token than o1-mini.

Safety

One of the key techniques we used to teach OpenAI o3-mini to respond safely is deliberative alignment, where we trained the model to reason about human-written safety specifications before answering user prompts. Similar to OpenAI o1, we find that o3-mini significantly surpasses GPT-4o on challenging safety and jailbreak evaluations. Before deployment, we carefully assessed the safety risks of o3-mini using the same approach to preparedness, external red-teaming, and safety evaluations as o1. We thank the safety testers who applied to test o3-mini in early access. Details of the evaluations below, along with a comprehensive explanation of potential risks and the effectiveness of our mitigations, are available in the o3-mini system card.

Disallowed content evaluations

The table compares AI models on safety metrics, evaluating performance across different risk categories. It highlights variations in safety compliance, with some models performing better at reducing potential risks.

Jailbreak Evaluations

The table compares AI models on safety metrics across multiple risk categories, showing performance variations. It highlights differences in risk mitigation, with some models demonstrating stronger compliance and safer responses.

What's next

The release of OpenAI o3-mini marks another step in OpenAI’s mission to push the boundaries of cost-effective intelligence. By optimizing reasoning for STEM domains while keeping costs low, we’re making high-quality AI even more accessible. This model continues our track record of driving down the cost of intelligence—reducing per-token pricing by 95% since launching GPT-4—while maintaining top-tier reasoning capabilities. As AI adoption expands, we remain committed to leading at the frontier, building models that balance intelligence, efficiency, and safety at scale.

Authors

OpenAI

Training

Brian Zhang, Eric Mitchell, Hongyu Ren, Kevin Lu, Max Schwarzer, Michelle Pokrass, Shengjia Zhao, Ted Sanders

Eval

Adam Kalai, Alex Tachard Passos, Ben Sokolowsky, Elaine Ya Le, Erik Ritter, Hao Sheng, Hanson Wang, Ilya Kostrikov, James Lee, Johannes Ferstad, Michael Lampe, Prashanth Radhakrishnan, Sean Fitzgerald, Sebastien Bubeck, Yann Dubois, Yu Bai

Frontier Evals & Preparedness

Andy Applebaum, Elizabeth Proehl, Evan Mays, Joel Parish, Kevin Liu, Leon Maksin, Leyton Ho, Miles Wang, Michele Wang, Olivia Watkins, Patrick Chao, Samuel Miserendino, Tejal Patwardhan

Engineering

Adam Walker, Akshay Nathan, Alyssa Huang, Andy Wang, Ankit Gohel, Ben Eggers, Brian Yu, Bryan Ashley, Chengdu Huang, Christian Hoareau, Davin Bogan, Emily Sokolova, Eric Horacek, Eric Jiang, Felipe Petroski Such, Jonah Cohen, Josh Gross, Justin Becker, Kan Wu, Kevin Whinnery, Larry Lv, Lee Byron, Manoli Liodakis, Max Johnson, Mike Trpcic, Murat Yesildal, Rasmus Rygaard, RJ Marsan, Rohit Ramchandani, Rohan Kshirsagar, Roman Huet, Sara Conlon, Shuaiqi (Tony) Xia, Siyuan Fu, Srinivas Narayanan, Sulman Choudhry, Tomer Kaftan, Trevor Creech

Search

Adam Fry, Adam Perelman, Brandon Wang, Cristina Scheau, Philip Pronin, Sundeep Tirumalareddy, Will Ellsworth, Zewei Chu

Product

Antonia Woodford, Beth Hoover, Jake Brill, Kelly Stirman, Minnia Feng, Neel Ajjarapu, Nick Turley, Nikunj Handa, Olivier Godement

Safety

Andrea Vallone, Andrew Duberstein, Enis Sert, Eric Wallace, Grace Zhao, Irina Kofman, Jieqi Yu, Joaquin Quinonero Candela, Madelaine Boyd, Mehmet Yatbaz, Mike McClay, Mingxuan Wang, Saachi Jain, Sandhini Agarwal, Sam Toizer, Santiago Hernández, Steve Mostovoy, Young Cha, Tao Li, Yunyun Wang

External Redteaming

Lama Ahmad, Troy Peterson

Research Program Managers

Carpus Chang, Kristen Ying

Leadership

Aidan Clark, Dane Stuckey, Jerry Tworek, Jakub Pachocki, Johannes Heidecke, Kevin Weil, Liam Fedus, Mark Chen, Sam Altman, Wojciech Zaremba

+ all the contributors behind o1.