2025年1月31日

OpenAI o3‑mini

推動成本效益推理的先驅。

正在載入...

我們正在發行的 OpenAI o3‑mini 是推理系列中最新、最具成本效益的模型，目前可在 ChatGPT 和 API 中使用。這款強大而快速的模型於 2024 年 12 月提供預覽⁠，突破了小型模型能實現的界限，提供卓越的 STEM 功能（尤其在科學、數學和編碼方面具有優勢），同時保持 OpenAI o1‑mini 的低成本和低延遲。

OpenAI o3‑mini 是我們的第一款小型推理模型，支援開發人員高度要求的功能，包括函數呼叫⁠（在新視窗中開啟）、結構化輸出⁠（在新視窗中開啟）和開發人員訊息⁠（在新視窗中開啟），使它可以立即投入生產。與 OpenAI o1‑mini 和 OpenAI o1‑preview 一樣，o3‑mini 將支援串流⁠（在新視窗中開啟）。此外，開發人員可以選擇三種推理强度⁠（在新視窗中開啟）選項（低、中、高），以針對其特定用例最佳化演算法。這種靈活性使得 o3‑mini 能在應對複雜挑戰時「努力思考」，或者在延遲成為問題時優先考慮速度。o3‑mini 不支援視覺功能，因此開發人員應該繼續使用 OpenAI o1 完成視覺推理任務。從今天開始，我們將在聊天完成 API、助理 API 和批次 API 中選取 API 使用層級為 3-5 級⁠（在新視窗中開啟）的開發人員向他們推出 o3‑mini。

ChatGPT Plus、Team 和 Pro 用戶從今天開始可以存取 OpenAI o3‑mini，Enterprise 用戶將於 2 月獲得存取權。o3‑mini 將在模型挑選器中取代 OpenAI o1‑mini，提供更高的速率限制和更低的延遲，使它成為編碼、STEM 和邏輯問題解決任務的有力選擇。為了順利完成這次升級，我們將 Plus 和團隊版用戶的訊息限額提高三倍，從 o1‑mini 的每天 50 則訊息增加到 o3‑mini 的每天 150 則訊息。此外，o3‑mini 現在可以透過搜尋功能尋找具有相關網頁資源連結的最新答案。這是我們致力於將搜尋功能整合到推理模型中的早期原型。

從今天開始，免費計劃用戶還可以在訊息編輯器中選取「推理」或重新產生回覆來嘗試 OpenAI o3‑mini。這標誌著 ChatGPT 首次為免費用戶提供推理模型。

雖然 OpenAI o1 仍然是我們更廣泛的一般常識推理模型，OpenAI o3‑mini 為需要精準度和速度的技術領域提供了一種專門的替代方案。在 ChatGPT 中，o3‑mini 使用中等推理强度設定來在速度和準確度之間取得最佳平衡。所有付費用戶還可以選擇在模型挑選器中選取 o3‑mini‑high，以獲得智慧程度更高的版本，但要花更多時間才會產生回覆。Pro 用戶可以無限制存取 o3‑mini 和 o3‑mini‑high。

快速、強大並最佳化 STEM 功能

與其前身 OpenAI o1 類似，OpenAI o3‑mini 最佳化 STEM 推理功能。具有中等推理强度設定的 o3‑mini 在數學、編碼和科學方面的效能與 o1 相符，同時回覆速度更快。經過專家測試人員的評估，o3‑mini 的答案比 OpenAI o1‑mini 更準確、更清晰，推理能力更強。56% 的測試人員喜歡 o3‑mini 而不是 o1‑mini 的答案，並且發現在解決現實世界難題時嚴重錯誤減少了 39%。使用中等推理强度設定的 o3‑mini 在一些最具挑戰性的推理和智慧評估（包括 AIME 和 GPQA）上的效能與 o1 相符。

Competition Math (AIME 2024)

The bar chart compares accuracy on AIME 2024 competition math questions across AI models. Older models (gray) score lower, while newer ones (yellow) improve. "o3-mini (high)" reaches the highest accuracy at 83.6%, showing significant progress.

Mathematics: With low reasoning effort, OpenAI o3‑mini achieves comparable performance with OpenAI o1‑mini, while with medium effort, o3‑mini achieves comparable performance with o1. Meanwhile, with high reasoning effort, o3‑mini outperforms both OpenAI o1‑mini and OpenAI o1, where the gray shaded regions show the performance of majority vote (consensus) with 64 samples.

PhD-level Science Questions (GPQA Diamond)

PhD-level science: On PhD-level biology, chemistry, and physics questions, with low reasoning effort, OpenAI o3‑mini achieves performance above OpenAI o1‑mini. With high effort, o3‑mini achieves comparable performance with o1.

FrontierMath

A black grid with multiple rows and columns, separated by thin white lines, creating a structured and organized layout.

Research-level mathematics: OpenAI o3‑mini with high reasoning performs better than its predecessor on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.

Competition Code (Codeforces)

The bar chart compares Elo ratings on Codeforces competition coding tasks across AI models. Older models (gray) score lower, while newer ones (yellow) improve. "o3-mini (high)" reaches 2073 Elo, showing significant progress over previous versions.

Competition coding: On Codeforces competitive programming, OpenAI o3‑mini achieves progressively higher Elo scores with increased reasoning effort, all outperforming o1‑mini. With medium reasoning effort, it matches o1’s performance.

Software Engineering (SWE-bench Verified (n=477))

The bar chart compares accuracy on SWE-bench Verified software engineering tasks across AI models. Older models (gray) perform lower, while "o3-mini (high)" (yellow) achieves the highest accuracy at 48.9%, showing improvement over previous versions.

Software engineering: o3‑mini is our highest performing released model on SWEbench-verified. For additional datapoints on SWE-bench Verified results with high reasoning effort, including with the open-source Agentless scaffold (39%) and an internal tools scaffold representing maximum capability elicitation (61%), see our system card⁠⁠ as the source of truth. All SWE-bench evaluation runs use a fixed subset of n=477 verified tasks which have been validated on our internal infrastructure.

LiveBench Coding

The table compares AI models on coding tasks, showing performance metrics and evaluation scores. It highlights differences in accuracy and efficiency, with some models outperforming others in specific benchmarks.

LiveBench coding: OpenAI o3‑mini surpasses o1‑high even at medium reasoning effort, highlighting its efficiency in coding tasks. At high reasoning effort, o3‑mini further extends its lead, achieving significantly stronger performance across key metrics.

一般常識

The table titled "Category Evals" compares AI models across different evaluation categories, showing performance metrics. It highlights differences in accuracy, efficiency, and effectiveness, with some models outperforming others in specific tasks.

General knowledge: o3‑mini outperforms o1‑mini in knowledge evaluations across general knowledge domains.

人類喜好設定評估

The chart compares win rates for STEM and non-STEM tasks across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in both categories, with a higher win rate for STEM tasks.

The chart compares win rates under time constraints and major error rates across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in win rate and significantly reduces major errors.

Human preference evaluation: Evaluations by external expert testers also show that OpenAI o3‑mini produces more accurate and clearer answers, with stronger reasoning abilities than OpenAI o1‑mini, especially for STEM. Testers preferred o3‑mini's responses to o1‑mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions.

模型速度和效能

OpenAI o3‑mini 擁有與 OpenAI o1 相符的智慧，效能更快，效率更高。除了上述強調的 STEM 評估之外，使用中等推理强度設定的 o3‑mini 還在數學和事實性評估中表現出優異的成績。在 A/B 測試中，o3‑mini 的回覆速度比 o1‑mini 快 24%，平均回覆時間為 7.7 秒，而 o1‑mini 為 10.16 秒。

Latency comparison between o1-mini and o3-mini (medium)

The bar chart compares latency between "o1-mini" and "o3-mini (medium)" models. "o3-mini" (lighter yellow) has lower latency, indicating faster response times, while "o1-mini" (darker yellow) takes longer on average.

Latency: o3‑mini has an avg 2500ms faster time to first token than o1‑mini.

安全

我們用來教導 OpenAI o3‑mini 安全回覆的其中一個關鍵技術是審慎對齊，我們訓練模型在回答用戶提示之前推理人類編寫的安全規範。與 OpenAI o1 類似，我們發現 o3‑mini 在挑戰安全性和越獄評估方面明顯超越 GPT‑4o。在部署之前，我們採用與 o1 相同的準備、外部紅隊和安全評估方法來仔細評估 o3‑mini 的安全風險。感謝申請早期測試 o3‑mini 的安全測試人員。請參閱 o3‑mini 系統卡中有關以下評估的詳細資料以及對潛在風險和緩解措施有效性的全面解釋。

Disallowed content evaluations

The table compares AI models on safety metrics, evaluating performance across different risk categories. It highlights variations in safety compliance, with some models performing better at reducing potential risks.

Jailbreak Evaluations

The table compares AI models on safety metrics across multiple risk categories, showing performance variations. It highlights differences in risk mitigation, with some models demonstrating stronger compliance and safer responses.

下一步是什麽

OpenAI o3‑mini 的發行標誌著 OpenAI 在突破高性價比智慧界限的使命上又邁出了一步。我們透過最佳化 STEM 領域的推理同時保持成本效益，讓人們更容易獲得高品質的人工智慧。這款模型繼續保持著我們降低智慧成本的記錄（自推出 GPT‑4 以來，每 Token 價格降低了 95%），同時保持著頂級推理能力。隨著人工智慧不斷普及，我們將繼續致力於引領前沿，建立平衡智慧、效率和安全的大規模模型。

作者

OpenAI

訓練

Brian Zhang、Eric Mitchell、Hongyu Ren、Kevin Lu、Max Schwarzer、Michelle Pokrass、Shengjia Zhao、Ted Sanders

評估

Adam Kalai、Alex Tachard Passos、Ben Sokolowsky、Elaine Ya Le、Erik Ritter、Hao Sheng、Hanson Wang、Ilya Kostrikov、James Lee、Johannes Ferstad、Michael Lampe、Prashanth Radhakrishnan、Sean Fitzgerald、Sebastien Bubeck、Yann Dubois、Yu Bai

最前線評估與備援準備

Andy Applebaum、Elizabeth Proehl、Evan Mays、Joel Parish、Kevin Liu、Leon Maksin、Leyton Ho、Miles Wang、Michele Wang、Olivia Watkins、Patrick Chao、Samuel Miserendino、Tejal Patwardhan

工程

Adam Walker、Akshay Nathan、Alyssa Huang、Andy Wang、Ankit Gohel、Ben Eggers、Brian Yu、Bryan Ashley、Callie Riggins Zetino、Chengdu Huang、Christian Hoareau、Davin Bogan、Emily Sokolova、Eric Horacek、Eric Jiang、Felipe Petroski Such、Jonah Cohen、Josh Gross、Justin Becker、Kan Wu、Kevin Whinnery、Larry Lv、Lee Byron、Lien Mamitsuka、Manoli Liodakis、Max Johnson、Mike Trpcic、Murat Yesildal、Rasmus Rygaard、RJ Marsan、Rohit Ramchandani、Rohan Kshirsagar、Roman Huet、Sara Conlon、Shuaiqi (Tony) Xia、Siyuan Fu、Srinivas Narayanan、Sulman Choudhry、Surya Mamidyala、Tomer Kaftan、Trevor Creech

搜尋

Adam Fry、Adam Perelman、Brandon Wang、Cristina Scheau、Philip Pronin、Sundeep Tirumalareddy、Will Ellsworth、Zewei Chu

產品

Antonia Woodford、Beth Hoover、Jake Brill、Kelly Stirman、Minnia Feng、Neel Ajjarapu、Nick Turley、Nikunj Handa、Olivier Godement

安全機制

Alex Beutel、Andrea Vallone、Andrew Duberstein、Enis Sert、Eric Wallace、Grace Zhao、Irina Kofman、Jieqi Yu、Joaquin Quinonero Candela、Madelaine Boyd、Matt Jones、Mehmet Yatbaz、Mike McClay、Mingxuan Wang、Saachi Jain、Sandhini Agarwal、Sam Toizer、Santiago Hernández、Steve Mostovoy、Young Cha、Tao Li、Yunyun Wang

外部紅隊測試

Lama Ahmad、Michael Lampe、Troy Peterson

研究計畫經理

Carpus Chang、Kristen Ying

領導層

Aidan Clark、Dane Stuckey、Jerry Tworek、Jakub Pachocki、Johannes Heidecke、Kevin Weil、Liam Fedus、Mark Chen、Sam Altman、Wojciech Zaremba

+ o1 背後的所有貢獻者⁠。