跳至主要內容
OpenAI

2025年1月31日

發布

OpenAI o3‑mini

拓展高成本效益的推理技術極限

載入中…

我們正式推出 OpenAI o3‑mini:這是最新且最具成本效益的推理系列模型,現已在 ChatGPT 和 API 中上線。這款功能強大且運算快速的模型已於 2024 年 12 月推出預覽版,不僅將小型模型的表現推向新高,還具備優異的 STEM 處理能力,特別擅於科學、數學和程式設計;在維持低成本的同時,也改善了 OpenAI o1‑mini 的延遲表現。

OpenAI o3‑mini 是我們最小型的推理模型,支援函式呼叫(在新視窗中開啟)結構化輸出(在新視窗中開啟)開發人員訊息(在新視窗中開啟)等備受開發人員期盼的功能,開箱即用,能夠立即投入實際應用。和 OpenAI o1‑mini 和 OpenAI o1 預覽版一樣,o3‑mini 將支援串流式輸出(在新視窗中開啟)功能;此外,還提供低、中、高三種推理等級(在新視窗中開啟)供開發人員選擇,可針對特定使用案例採用最合適的等級。這樣的靈活性讓 o3‑mini 在面對複雜挑戰時能夠「深入思考」,而在有延遲限制的情境中,也能優先追求執行速度。由於 o3‑mini 不支援視覺功能,建議開發人員在處理視覺推理任務時,仍使用 OpenAI o1。自今日起,o3‑mini 將逐步開放給具備 API 使用等級 3 至 5(在新視窗中開啟) 的特定開發人員,並可透過 Chat Completions API、Assistants API 與 Batch API 使用。

ChatGPT Plus、Team 和 Pro 用戶自今日起即可使用 OpenAI o3‑mini;Enterprise 用戶則將於 2 月開放使用。在模型選擇器中,o3‑mini 將取代 OpenAI o1‑mini,不僅提升流量上限,也顯著降低延遲。因此,o3‑mini 成為處理程式編寫、STEM 任務與邏輯解題的理想之選。本次升級後,Plus 和 Team 用戶的流量限額將提升三倍,從 o1‑mini 的每日 50 則訊息,提高至 o3‑mini 的每日 150 則。 此外,o3‑mini 現在還能搭配搜尋功能,尋找最新答案並附上相關網路來源的連結。這還只是早期原型,我們正努力將搜尋功能整合到推理模型中。

從今天起,Free 方案用戶也能試用 OpenAI o3‑mini,只要在訊息撰寫工具中選擇「推理」模式或重新生成回應即可。這是第一次開放推理模型給 ChatGPT 免費版用戶使用。

OpenAI o1 依然是我們應用最廣泛的通識推理模型,而 OpenAI o3‑mini 則是針對講求精準度與速度的技術領域,提供專用的替代方案。在 ChatGPT 中,o3‑mini 採用中推理等級,達到速度與準確度的平衡。所有付費用戶也能在模型挑選器中選取 o3‑mini‑high,使用推理等級更高的版本來生成回應,不過處理時間會略長。Pro 用戶可不限次數使用 o3‑minio3‑mini‑high

快速,強大且擅於處理 STEM 推理任務

類似於上一代的 OpenAI o1,OpenAI o3‑mini 也針對 STEM 推理進行最佳化。在處理數學、程式設計和科學相關任務時,採用中推理等級的 o3‑mini 表現可媲美 o1,且回應速度更快。根據專家測試人員的評估,相較於 OpenAI o1‑mini,o3‑mini 生成的答案更準確清楚,且推理能力更強。測試人員在比較 o3‑mini 與 o1‑mini 的回應時,有 56% 偏好前者;在回答困難的真實世界問題時,o3‑mini 出現重大錯誤的比例降低了 39%。處理 AIME 和 GPQA 這類難度最高的推理和智力測驗時,採用中推理等級的 o3‑mini 表現與 o1 不相上下。

Competition Math (AIME 2024)

The bar chart compares accuracy on AIME 2024 competition math questions across AI models. Older models (gray) score lower, while newer ones (yellow) improve. "o3-mini (high)" reaches the highest accuracy at 83.6%, showing significant progress.

Mathematics: With low reasoning effort, OpenAI o3‑mini achieves comparable performance with OpenAI o1‑mini, while with medium effort, o3‑mini achieves comparable performance with o1. Meanwhile, with high reasoning effort, o3‑mini outperforms both OpenAI o1‑mini and OpenAI o1, where the gray shaded regions show the performance of majority vote (consensus) with 64 samples.

PhD-level Science Questions (GPQA Diamond)

The bar chart compares accuracy on PhD-level science questions (GPQA Diamond) across AI models. Older models (gray) perform lower, while newer ones (yellow) improve. "o3-mini (high)" reaches 77.0% accuracy, showing notable progress over earlier versions.

PhD-level science: On PhD-level biology, chemistry, and physics questions, with low reasoning effort, OpenAI o3‑mini achieves performance above OpenAI o1‑mini. With high effort, o3‑mini achieves comparable performance with o1.

FrontierMath

A black grid with multiple rows and columns, separated by thin white lines, creating a structured and organized layout.

Research-level mathematics: OpenAI o3‑mini with high reasoning performs better than its predecessor on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.

Competition Code (Codeforces)

The bar chart compares Elo ratings on Codeforces competition coding tasks across AI models. Older models (gray) score lower, while newer ones (yellow) improve. "o3-mini (high)" reaches 2073 Elo, showing significant progress over previous versions.

Competition coding: On Codeforces competitive programming, OpenAI o3‑mini achieves progressively higher Elo scores with increased reasoning effort, all outperforming o1‑mini. With medium reasoning effort, it matches o1’s performance.

Software Engineering (SWE-bench Verified (n=477))

The bar chart compares accuracy on SWE-bench Verified software engineering tasks across AI models. Older models (gray) perform lower, while "o3-mini (high)" (yellow) achieves the highest accuracy at 48.9%, showing improvement over previous versions.

Software engineering: o3‑mini is our highest performing released model on SWEbench-verified. For additional datapoints on SWE-bench Verified results with high reasoning effort, including with the open-source Agentless scaffold (39%) and an internal tools scaffold representing maximum capability elicitation (61%), see our system card⁠ as the source of truth. All SWE-bench evaluation runs use a fixed subset of n=477 verified tasks which have been validated on our internal infrastructure.

LiveBench Coding

The table compares AI models on coding tasks, showing performance metrics and evaluation scores. It highlights differences in accuracy and efficiency, with some models outperforming others in specific benchmarks.

LiveBench coding: OpenAI o3‑mini surpasses o1‑high even at medium reasoning effort, highlighting its efficiency in coding tasks. At high reasoning effort, o3‑mini further extends its lead, achieving significantly stronger performance across key metrics.

通用知識

The table titled "Category Evals" compares AI models across different evaluation categories, showing performance metrics. It highlights differences in accuracy, efficiency, and effectiveness, with some models outperforming others in specific tasks.

General knowledge: o3‑mini outperforms o1‑mini in knowledge evaluations across general knowledge domains.

人類偏好評估

The chart compares win rates for STEM and non-STEM tasks across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in both categories, with a higher win rate for STEM tasks.
The chart compares win rates under time constraints and major error rates across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in win rate and significantly reduces major errors.

Human preference evaluation: Evaluations by external expert testers also show that OpenAI o3‑mini produces more accurate and clearer answers, with stronger reasoning abilities than OpenAI o1‑mini, especially for STEM. Testers preferred o3‑mini's responses to o1‑mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions.

模型速度和表現

OpenAI o3‑mini 具備媲美 OpenAI o1 的智慧,且速度更快、效率更高。除了上述重點 STEM 評估之外,採用中推理等級的 o3‑mini 在其他數學和事實評估的表現也相當亮眼。在 A/B 測試中,o3‑mini 提供回應的速度比 o1‑mini 快 24%;兩者平均回應時間分別為 7.7 秒和 10.16 秒。

Latency comparison between o1-mini and o3-mini (medium)

The bar chart compares latency between "o1-mini" and "o3-mini (medium)" models. "o3-mini" (lighter yellow) has lower latency, indicating faster response times, while "o1-mini" (darker yellow) takes longer on average.

Latency: o3‑mini has an avg 2500ms faster time to first token than o1‑mini.

安全性

我們指導 OpenAI o3‑mini 安全回應的關鍵技術之一是協商共識 (deliberative alignment),該技術訓練模型在回答使用者提示前,先對人類編寫的安全規範進行推理。與 OpenAI o1 相似,我們發現在高難度的安全和越獄評估測試中,o3‑mini 的表現明顯優於 GPT‑4o。部署之前,我們透過與 o1 相同的風險準備與應對規劃、外部紅隊測試和安全評估程序,審慎評估了 o3‑mini 的安全風險。我們也感謝申請 o3‑mini 搶先體驗並協助測試的安全測試人員。o3‑mini 系統說明卡提供以下評估詳細資訊,包含潛在風險的完整說明和緩解措施的有效性。

Disallowed content evaluations

The table compares AI models on safety metrics, evaluating performance across different risk categories. It highlights variations in safety compliance, with some models performing better at reducing potential risks.

Jailbreak Evaluations

The table compares AI models on safety metrics across multiple risk categories, showing performance variations. It highlights differences in risk mitigation, with some models demonstrating stronger compliance and safer responses.

後續更新

OpenAI o3‑mini 的問世,為 OpenAI 宏圖再添一塊拼圖,更進一步拓展了高成本效益的人工智慧極限。我們在維持低廉價格的同時,成功最佳化 STEM 領域的推理能力,讓高品質 AI 更加平易近人。這款模型延續了我們持續降低 AI 成本的成果,自 GPT‑4 推出以來,每 token 的價格已下調 95%,同時仍維持頂尖的推理能力。隨著 AI 採用規模擴大,我們持續致力扮演先驅角色,打造在大量使用者同時使用時,兼具智慧、效率與安全的模型。

作者

OpenAI

訓練

Brian Zhang、Eric Mitchell、Hongyu Ren、Kevin Lu、Max Schwarzer、Michelle Pokrass、Shengjia Zhao、Ted Sanders

評估

Adam Kalai、Alex Tachard Passos、Ben Sokolowsky、Elaine Ya Le、Erik Ritter、Hao Sheng、Hanson Wang、Ilya Kostrikov、James Lee、Johannes Ferstad、Michael Lampe、Prashanth Radhakrishnan、Sean Fitzgerald、Sebastien Bubeck、Yann Dubois、Yu Bai

前沿評估與應變準備

Andy Applebaum、Elizabeth Proehl、Evan Mays、Joel Parish、Kevin Liu、Leon Maksin、Leyton Ho、Miles Wang、Michele Wang、Olivia Watkins、Patrick Chao、Samuel Miserendino、Tejal Patwardhan

工程

Adam Walker、Akshay Nathan、Alyssa Huang、Andy Wang、Ankit Gohel、Ben Eggers、Brian Yu、Bryan Ashley、Callie Riggins Zetino、Chengdu Huang、Christian Hoareau、Davin Bogan、Emily Sokolova、Eric Horacek、Eric Jiang、Felipe Petroski Such、Jonah Cohen、Josh Gross、Justin Becker、Kan Wu、Kevin Whinnery、Larry Lv、Lee Byron、Lien Mamitsuka、Manoli Liodakis、Max Johnson、Mike Trpcic、Murat Yesildal、Rasmus Rygaard、RJ Marsan、Rohit Ramchandani、Rohan Kshirsagar、Roman Huet、Sara Conlon、Shuaiqi (Tony) Xia、Siyuan Fu、Srinivas Narayanan、Sulman Choudhry、Surya Mamidala、Tomer Kaftan、Trevor Creech

搜尋

Adam Fry、Adam Perelman、Brandon Wang、Cristina Scheau、Philip Pronin、Sundeep Tirumalareddy、Will Ellsworth、Zewei Chu

產品

Antonia Woodford、Beth Hoover、Jake Brill、Kelly Stirman、Minnia Feng、Neel Ajjarapu、Nick Turley、Nikunj Handa、Olivier Godement

安全機制

Alex Beutel、Andrea Vallone、Andrew Duberstein、Enis Sert、Eric Wallace、Grace Zhao、Irina Kofman、Jieqi Yu、Joaquin Quinonero Candela、Madelaine Boyd、Matt Jones、Mehmet Yatbaz、Mike McClay、Mingxuan Wang、Saachi Jain、Sandhini Agarwal、Sam Toizer、Santiago Hernández、Steve Mostovoy、Young Cha、Tao Li、Yunyun Wang

外部紅隊測試

Lama Ahmad、Michael Lampe、Troy Peterson

研究計畫經理

Carpus Chang、Kristen Ying

領導團隊

Aidan Clark、Dane Stuckey、Jerry Tworek、Jakub Pachocki、Johannes Heidecke、Kevin Weil、Liam Fedus、Mark Chen、Sam Altman、Wojciech Zaremba

+ 所有參與 o1 的貢獻者