2025年1月31日

OpenAI o3‑mini

推动高性价比推理的前沿发展。

正在加载…

我们正在发布 OpenAI o3‑mini，这是我们推理系列中最新、最具成本效益的模型，现在可在 ChatGPT 和 API 中使用。这款功能强大的快速模型于 2024 年 12 月进行了预览发布⁠，突破了小型模型的能力极限，提供卓越的 STEM 能力，尤其在科学、数学和编码方面具有优势，同时保持了 OpenAI o1‑mini 的低成本和低延迟性能。

OpenAI o3‑mini 是我们首款小型推理模型，支持开发人员高度期待的功能，包括函数调用⁠（在新窗口中打开）、结构化输出⁠（在新窗口中打开）及开发人员消息⁠（在新窗口中打开），使其从一开始就具备生产就绪状态。与 OpenAI o1‑mini 和 OpenAI o1‑preview 一样，o3‑mini 将支持流式传输⁠（在新窗口中打开）。此外，开发人员可以在三种推理强度⁠（在新窗口中打开）选项（低、中、高）中进行选择，以针对其特定用例进行优化。这种灵活性使 o3‑mini 在应对复杂挑战时能够“深度思考”，或在担心延迟问题时优先考虑速度。o3‑mini 暂不支持视觉功能，开发人员处理视觉推理任务时仍需使用 OpenAI o1。即日起，o3‑mini 将在聊天完成 API、助手 API 和批处理 API 中逐步推出，面向 API 使用第 3 至第 5 级⁠（在新窗口中打开）的部分开发人员开放。

ChatGPT Plus、Team 和 Pro 版用户即日起可访问 OpenAI o3‑mini， Enterprise 版访问权限将于二月推出。o3‑mini 将取代模型选择器中的 OpenAI o1‑mini，提供更高的消息限额和更低的延迟，使其成为编码、STEM 和逻辑问题解决任务的理想选择。作为此次升级的一部分，我们将 Plus 和 Team 用户的消息限额提高了三倍，从 o1‑mini 的每天 50 条消息增加到 o3‑mini 的每天 150 条消息。此外，o3‑mini 现在可以通过搜索功能查找带有相关网络资源链接的最新答案。这是我们致力于将搜索整合到推理模型中的早期原型。

即日起，Free 套餐用户还可以通过在消息编写器中选择“推理”或重新生成回复来试用 OpenAI o3‑mini。这标志着 ChatGPT 首次向免费用户提供推理模型。

虽然 OpenAI o1 仍然是我们更广泛的通用知识推理模型，但 OpenAI o3‑mini 为需要精确度和速度的技术领域提供了专门的替代方案。在 ChatGPT 中，o3‑mini 使用中等推理强度，在速度和准确性之间实现了平衡。所有付费用户还可以在模型选择器中选择 o3‑mini‑high，以获得生成回复时间稍长的高智能版本。Pro 用户可无限制访问 o3‑mini 和 o3‑mini‑high。

快速、强大且针对 STEM 推理进行了优化

与其前身 OpenAI o1 类似，OpenAI o3‑mini 针对 STEM 推理进行了优化。中等推理强度的 o3‑mini 在数学、编码和科学方面的表现与 o1 相当，同时回复速度更快。经专家测试人员的评估，o3‑mini 的答案比 OpenAI o1‑mini 更准确、更清晰，推理能力更强。56% 的测试人员更喜欢 o3‑mini 而不是 o1‑mini 的答案，并且还观察到在解决现实世界难题时主要错误减少了 39%。在中等推理强度下，o3‑mini 在一些最具挑战性的推理和智能评估（包括 AIME 和 GPQA）中的表现与 o1 相当。

Competition Math (AIME 2024)

The bar chart compares accuracy on AIME 2024 competition math questions across AI models. Older models (gray) score lower, while newer ones (yellow) improve. "o3-mini (high)" reaches the highest accuracy at 83.6%, showing significant progress.

Mathematics: With low reasoning effort, OpenAI o3‑mini achieves comparable performance with OpenAI o1‑mini, while with medium effort, o3‑mini achieves comparable performance with o1. Meanwhile, with high reasoning effort, o3‑mini outperforms both OpenAI o1‑mini and OpenAI o1, where the gray shaded regions show the performance of majority vote (consensus) with 64 samples.

PhD-level Science Questions (GPQA Diamond)

PhD-level science: On PhD-level biology, chemistry, and physics questions, with low reasoning effort, OpenAI o3‑mini achieves performance above OpenAI o1‑mini. With high effort, o3‑mini achieves comparable performance with o1.

FrontierMath

A black grid with multiple rows and columns, separated by thin white lines, creating a structured and organized layout.

Research-level mathematics: OpenAI o3‑mini with high reasoning performs better than its predecessor on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.

Competition Code (Codeforces)

The bar chart compares Elo ratings on Codeforces competition coding tasks across AI models. Older models (gray) score lower, while newer ones (yellow) improve. "o3-mini (high)" reaches 2073 Elo, showing significant progress over previous versions.

Competition coding: On Codeforces competitive programming, OpenAI o3‑mini achieves progressively higher Elo scores with increased reasoning effort, all outperforming o1‑mini. With medium reasoning effort, it matches o1’s performance.

Software Engineering (SWE-bench Verified (n=477))

The bar chart compares accuracy on SWE-bench Verified software engineering tasks across AI models. Older models (gray) perform lower, while "o3-mini (high)" (yellow) achieves the highest accuracy at 48.9%, showing improvement over previous versions.

Software engineering: o3‑mini is our highest performing released model on SWEbench-verified. For additional datapoints on SWE-bench Verified results with high reasoning effort, including with the open-source Agentless scaffold (39%) and an internal tools scaffold representing maximum capability elicitation (61%), see our system card⁠⁠ as the source of truth. All SWE-bench evaluation runs use a fixed subset of n=477 verified tasks which have been validated on our internal infrastructure.

LiveBench Coding

The table compares AI models on coding tasks, showing performance metrics and evaluation scores. It highlights differences in accuracy and efficiency, with some models outperforming others in specific benchmarks.

LiveBench coding: OpenAI o3‑mini surpasses o1‑high even at medium reasoning effort, highlighting its efficiency in coding tasks. At high reasoning effort, o3‑mini further extends its lead, achieving significantly stronger performance across key metrics.

通用知识

The table titled "Category Evals" compares AI models across different evaluation categories, showing performance metrics. It highlights differences in accuracy, efficiency, and effectiveness, with some models outperforming others in specific tasks.

General knowledge: o3‑mini outperforms o1‑mini in knowledge evaluations across general knowledge domains.

人类偏好评估

The chart compares win rates for STEM and non-STEM tasks across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in both categories, with a higher win rate for STEM tasks.

The chart compares win rates under time constraints and major error rates across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in win rate and significantly reduces major errors.

Human preference evaluation: Evaluations by external expert testers also show that OpenAI o3‑mini produces more accurate and clearer answers, with stronger reasoning abilities than OpenAI o1‑mini, especially for STEM. Testers preferred o3‑mini's responses to o1‑mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions.

模型速度与性能

OpenAI o3‑mini 拥有与 OpenAI o1 相当的智能，性能更快，效率更高。除了上述重点呈现的 STEM 领域评估外，o3‑mini 在中等推理强度下，在其他数学能力与事实性评估中也展现出更优的性能表现。在 A/B 测试中，o3‑mini 的回复速度比 o1‑mini 快 24%，平均回复时间为 7.7 秒，而 o1‑mini 为 10.16 秒。

Latency comparison between o1-mini and o3-mini (medium)

The bar chart compares latency between "o1-mini" and "o3-mini (medium)" models. "o3-mini" (lighter yellow) has lower latency, indicating faster response times, while "o1-mini" (darker yellow) takes longer on average.

Latency: o3‑mini has an avg 2500ms faster time to first token than o1‑mini.

安全

我们用于教导 OpenAI o3‑mini 安全回复的核心技术之一是审慎对齐，即通过训练模型在回答用户指令前，先对人类编写的安全规范进行推理。与 OpenAI o1 类似，我们发现 o3‑mini 在具有挑战性的安全和越狱评估方面明显优于 GPT‑4o。在部署之前，我们采用与 o1 相同的防范准备、外部红队测试和安全评估，仔细评估了 o3‑mini 的安全风险。我们衷心感谢申请参与 o3‑mini 早期测试的安全测评人员。有关以下评估的详细信息，以及对潜在风险和我们缓解措施有效性的全面解释，请参阅 o3‑mini 系统卡。

Disallowed content evaluations

The table compares AI models on safety metrics, evaluating performance across different risk categories. It highlights variations in safety compliance, with some models performing better at reducing potential risks.

Jailbreak Evaluations

The table compares AI models on safety metrics across multiple risk categories, showing performance variations. It highlights differences in risk mitigation, with some models demonstrating stronger compliance and safer responses.

下一步计划

OpenAI o3‑mini 的发布标志着 OpenAI 在突破高性价比智能技术发展方面又迈出了坚实的一步。通过优化 STEM 领域的推理，同时保持较低的成本，我们让更多人可以获得高质量的人工智能。这款模型延续了我们降低智能成本的一贯做法，自推出 GPT‑4 以来，每个令牌的定价降低了 95%，同时保持了顶级的推理能力。随着人工智能应用的不断扩大，我们将继续致力于引领前沿研究，打造兼顾智能、效率和安全的大规模模型。

作者

OpenAI

训练

Brian Zhang、Eric Mitchell、Hongyu Ren、Kevin Lu、Max Schwarzer、Michelle Pokrass、Shengjia Zhao、Ted Sanders

评估

Adam Kalai、Alex Tachard Passos、Ben Sokolowsky、Elaine Ya Le、Erik Ritter、Hao Sheng、Hanson Wang、Ilya Kostrikov、James Lee、Johannes Ferstad、Michael Lampe、Prashanth Radhakrishnan、Sean Fitzgerald、Sebastien Bubeck、Yann Dubois、Yu Bai

前沿评估与准备

Andy Applebaum、Elizabeth Proehl、Evan Mays、Joel Parish、Kevin Liu、Leon Maksin、Leyton Ho、Miles Wang、Michele Wang、Olivia Watkins、Patrick Chao、Samuel Miserendino、Tejal Patwardhan

工程

Adam Walker、Akshay Nathan、Alyssa Huang、Andy Wang、Ankit Gohel、Ben Eggers、Brian Yu、Bryan Ashley、Callie Riggins Zetino、Chengdu Huang、Christian Hoareau、Davin Bogan、Emily Sokolova、Eric Horacek、Eric Jiang、Felipe Petroski Such、Jonah Cohen、Josh Gross、Justin Becker、Kan Wu、Kevin Whinnery、Larry Lv、Lee Byron、Lien Mamitsuka、Manoli Liodakis、Max Johnson、Mike Trpcic、Murat Yesildal、Rasmus Rygaard、RJ Marsan、Rohit Ramchandani、Rohan Kshirsagar、Roman Huet、Sara Conlon、Shuaiqi (Tony) Xia、Siyuan Fu、Srinivas Narayanan、Sulman Choudhry、Surya Mamidyala、Tomer Kaftan、Trevor Creech

搜索

Adam Fry、Adam Perelman、Brandon Wang、Cristina Scheau、Philip Pronin、Sundeep Tirumalareddy、Will Ellsworth、Zewei Chu

产品

Antonia Woodford、Beth Hoover、Jake Brill、Kelly Stirman、Minnia Feng、Neel Ajjarapu、Nick Turley、Nikunj Handa、Olivier Godement

安全性

Alex Beutel、Andrea Vallone、Andrew Duberstein、Enis Sert、Eric Wallace、Grace Zhao、Irina Kofman、Jieqi Yu、Joaquin Quinonero Candela、Madelaine Boyd、Matt Jones、Mehmet Yatbaz、Mike McClay、Mingxuan Wang、Saachi Jain、Sandhini Agarwal、Sam Toizer、Santiago Hernández、Steve Mostovoy、Young Cha、Tao Li、Yunyun Wang

外部红队测试

Lama Ahmad、Michael Lampe、Troy Peterson

研究项目经理

Carpus Chang、Kristen Ying

领导团队

Aidan Clark、Dane Stuckey、Jerry Tworek、Jakub Pachocki、Johannes Heidecke、Kevin Weil、Liam Fedus、Mark Chen、Sam Altman、Wojciech Zaremba

+ o1 背后的所有贡献者⁠。