2025年1月31日

OpenAI o3‑mini

コスト効率の高い推論の限界を拡げる

読み込んでいます...

当社のリーズニングシリーズの中で最新、かつコスト効率が最も高いモデルである OpenAI o3‑mini を本日リリースいたします。ChatGPT と API の両方で今すぐご利用ください。2024 年 12 月にプレビューされた⁠この強力かつ高速なモデルは、OpenAI o1‑mini の低コスト・低レイテンシを維持しながら、科学、数学、コーディングの点で圧倒的に優れている STEM 機能を提供し、小型モデルが達成できる限界を押し拡げます。

OpenAI o3‑mini は、Function Calling⁠（新しいウィンドウで開く）、Structured Outputs⁠（新しいウィンドウで開く）、開発者メッセージ⁠（新しいウィンドウで開く）など、開発者から特に要望の多かった機能をサポートし、すぐに本番環境でご利用いただける、当社初の小型リーズニングモデルです。OpenAI o1‑mini や OpenAI o1‑preview と同様に、o3‑mini はストリーミング⁠（新しいウィンドウで開く）をサポートします。また開発者は、低、中、高の 3 つのレベルの推論努力⁠（新しいウィンドウで開く）オプションから選んで、特定のユースケースに合わせて最適化できます。この柔軟性により、o3‑mini は複雑な課題に取り組むときに「より深く思考する」ことができ、レイテンシが懸念される場合には速度を優先できます。o3‑mini は視覚機能をサポートしていないため、開発者は視覚的な推論タスクには引き続き OpenAI o1 を使用する必要があります。o3‑mini は、本日より API 使用ティア 3～5⁠（新しいウィンドウで開く）の一部の開発者を対象に、Chat Completions API、Assistants API、Batch API で公開されます。

ChatGPT Plus、Team、および Pro ユーザーは、本日から OpenAI o3‑mini にアクセスできます。Enterprise のアクセスは 2 月から利用可能です。o3‑mini はモデルピッカーで OpenAI o1‑mini に取って代わり、より高いレート制限、そしてより低いレイテンシを提供し、コーディング、STEM、および論理的問題解決タスクにとって魅力的な選択肢となります。このアップグレードの一環として、Plus と Team ユーザーのレート制限を、o1‑mini の 1 日あたり 50 メッセージから、o3‑mini では 1 日あたり 150 メッセージへ、3 倍に引き上げます。さらに、o3‑mini には検索機能が搭載され、関連するウェブソースへのリンクを含む最新の回答を検索できるようになりました。これは、リーズニングモデル全体にわたって検索を統合する作業の初期プロトタイプです。

本日より Free プランのユーザーも、メッセージコンポーザーで「推論」を選択するか、回答を再生成することで、OpenAI o3‑mini をお試しいただけます。ChatGPT で、リーズニングモデルが Free ユーザーに提供されるのは今回が初めてです。

OpenAI o1 は現時点でも、より広範な一般常識リーズニングモデルであり続けますが、OpenAI o3‑mini は精度と速度が求められる技術分野に特化した代替手段を提供します。ChatGPT において、o3‑mini は中程度の推論努力を使用して、速度と精度のバランスの取れた処理を提供します。すべての有料ユーザーは、モデルピッカーで o3‑mini‑high を選択することで、回答の生成には少し時間がかかるものの、より高度なインテリジェンス版を選択することもできます。Pro ユーザーは、o3‑mini と o3‑mini‑high の両方に無制限にアクセス可能です。

高速、強力、そして STEM 推論に最適

OpenAI o1の前身と同様、OpenAI o3‑mini は STEM 推論に対して最適化されています。中程度の推論努力を備えた o3‑mini は、数学、コーディング、科学における o1 のパフォーマンスに匹敵し、しかもより高速に回答を提供します。専門テスターの評価によると、o3‑mini は OpenAI o1‑mini よりも正確で明確な回答を生成し、推論能力も優れていることが示されました。テスターは o3‑mini の回答を o1‑mini より 56% 多く好み、現実世界の難しい質問に関する重大エラーが 39% 減少したことを観察しています。o3‑mini は中程度の推論努力で、AIME や GPQA などの最も難しい推論や知能評価のいくつかにおいて、o1 のパフォーマンスに匹敵します。

Competition Math (AIME 2024)

The bar chart compares accuracy on AIME 2024 competition math questions across AI models. Older models (gray) score lower, while newer ones (yellow) improve. "o3-mini (high)" reaches the highest accuracy at 83.6%, showing significant progress.

Mathematics: With low reasoning effort, OpenAI o3‑mini achieves comparable performance with OpenAI o1‑mini, while with medium effort, o3‑mini achieves comparable performance with o1. Meanwhile, with high reasoning effort, o3‑mini outperforms both OpenAI o1‑mini and OpenAI o1, where the gray shaded regions show the performance of majority vote (consensus) with 64 samples.

PhD-level Science Questions (GPQA Diamond)

PhD-level science: On PhD-level biology, chemistry, and physics questions, with low reasoning effort, OpenAI o3‑mini achieves performance above OpenAI o1‑mini. With high effort, o3‑mini achieves comparable performance with o1.

FrontierMath

A black grid with multiple rows and columns, separated by thin white lines, creating a structured and organized layout.

Research-level mathematics: OpenAI o3‑mini with high reasoning performs better than its predecessor on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.

Competition Code (Codeforces)

The bar chart compares Elo ratings on Codeforces competition coding tasks across AI models. Older models (gray) score lower, while newer ones (yellow) improve. "o3-mini (high)" reaches 2073 Elo, showing significant progress over previous versions.

Competition coding: On Codeforces competitive programming, OpenAI o3‑mini achieves progressively higher Elo scores with increased reasoning effort, all outperforming o1‑mini. With medium reasoning effort, it matches o1’s performance.

Software Engineering (SWE-bench Verified (n=477))

The bar chart compares accuracy on SWE-bench Verified software engineering tasks across AI models. Older models (gray) perform lower, while "o3-mini (high)" (yellow) achieves the highest accuracy at 48.9%, showing improvement over previous versions.

Software engineering: o3‑mini is our highest performing released model on SWEbench-verified. For additional datapoints on SWE-bench Verified results with high reasoning effort, including with the open-source Agentless scaffold (39%) and an internal tools scaffold representing maximum capability elicitation (61%), see our system card⁠⁠ as the source of truth. All SWE-bench evaluation runs use a fixed subset of n=477 verified tasks which have been validated on our internal infrastructure.

LiveBench Coding

The table compares AI models on coding tasks, showing performance metrics and evaluation scores. It highlights differences in accuracy and efficiency, with some models outperforming others in specific benchmarks.

LiveBench coding: OpenAI o3‑mini surpasses o1‑high even at medium reasoning effort, highlighting its efficiency in coding tasks. At high reasoning effort, o3‑mini further extends its lead, achieving significantly stronger performance across key metrics.

一般常識

The table titled "Category Evals" compares AI models across different evaluation categories, showing performance metrics. It highlights differences in accuracy, efficiency, and effectiveness, with some models outperforming others in specific tasks.

General knowledge: o3‑mini outperforms o1‑mini in knowledge evaluations across general knowledge domains.

人間の選好評価

The chart compares win rates for STEM and non-STEM tasks across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in both categories, with a higher win rate for STEM tasks.

The chart compares win rates under time constraints and major error rates across AI models. "o3_mini_v43_s960_j128" (yellow) outperforms "o1_mini_chatgpt" (red baseline) in win rate and significantly reduces major errors.

Human preference evaluation: Evaluations by external expert testers also show that OpenAI o3‑mini produces more accurate and clearer answers, with stronger reasoning abilities than OpenAI o1‑mini, especially for STEM. Testers preferred o3‑mini's responses to o1‑mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions.

モデルスピードとパフォーマンス

OpenAI o1 に匹敵するインテリジェンスを備えた OpenAI o3‑mini は、より高速なパフォーマンスを実現し、効率性を高めます。上記で強調した STEM 評価に加えて、o3‑mini は中程度の推論努力による追加の数学および事実評価でも優れた結果を示しています。A/B テストでは、o3‑mini は o1‑mini よりも 24% 速く回答し、平均回答時間は 10.16 秒に対して 7.7 秒でした。

Latency comparison between o1-mini and o3-mini (medium)

The bar chart compares latency between "o1-mini" and "o3-mini (medium)" models. "o3-mini" (lighter yellow) has lower latency, indicating faster response times, while "o1-mini" (darker yellow) takes longer on average.

Latency: o3‑mini has an avg 2500ms faster time to first token than o1‑mini.

安全性

安全に回答できるよう OpenAI o3‑mini に教えるために使用した重要な手法の 1 つは、熟慮的アライメントです。この手法では、ユーザーのプロンプトに答える前に、人間が作成した安全性の仕様について推論するようにモデルに学習させています。OpenAI o1 と同様に、o3‑mini が安全評価とジェイルブレイク評価において GPT‑4o を大幅に上回っていることがわかりました。当社では導入前に、Preparedness、外部レッドチーム、安全性評価に関する o1 と同じアプローチを使用して、o3‑mini の安全性リスクを慎重に評価しました。早期アクセスで o3‑mini のテストに応募してくださった安全性テスターの皆さまに感謝いたします。以下の評価の詳細、および潜在的なリスクと軽減策の有効性に関する包括的な説明は、o3‑mini System Card に記載されています。

Disallowed content evaluations

The table compares AI models on safety metrics, evaluating performance across different risk categories. It highlights variations in safety compliance, with some models performing better at reducing potential risks.

Jailbreak Evaluations

The table compares AI models on safety metrics across multiple risk categories, showing performance variations. It highlights differences in risk mitigation, with some models demonstrating stronger compliance and safer responses.

今後の展望

OpenAI o3‑mini のリリースは、コスト効率の高いインテリジェンスの限界を押し拡げるという OpenAI の使命への新たな一歩となります。コストを抑えつつ STEM ドメインの推論を最適化することで、高品質の AI をさらに利用しやすくしていきます。このモデルは、トップティアの推論機能を維持しながら、インテリジェンスのコストを削減し、GPT‑4 のリリース以来トークンあたりの料金を 95% 削減するという当社の実績を継続していきます。AI の導入が拡大する中、当社は最先端の技術をリードし、インテリジェンス、効率性、安全性のバランスを取ったモデル構築に取り組んでいます。

著者

OpenAI

学習

Brian Zhang, Eric Mitchell, Hongyu Ren, Kevin Lu, Max Schwarzer, Michelle Pokrass, Shengjia Zhao, Ted Sanders

評価

Adam Kalai, Alex Tachard Passos, Ben Sokolowsky, Elaine Ya Le, Erik Ritter, Hao Sheng, Hanson Wang, Ilya Kostrikov, James Lee, Johannes Ferstad, Michael Lampe, Prashanth Radhakrishnan, Sean Fitzgerald, Sebastien Bubeck, Yann Dubois, Yu Bai

フロンティア評価と準備

Andy Applebaum, Elizabeth Proehl, Evan Mays, Joel Parish, Kevin Liu, Leon Maksin, Leyton Ho, Miles Wang, Michele Wang, Olivia Watkins, Patrick Chao, Samuel Miserendino, Tejal Patwardhan

エンジニアリング

Adam Walker, Akshay Nathan, Alyssa Huang, Andy Wang, Ankit Gohel, Ben Eggers, Brian Yu, Bryan Ashley, Callie Riggins Zetino, Chengdu Huang, Christian Hoareau, Davin Bogan, Emily Sokolova, Eric Horacek, Eric Jiang, Felipe Petroski Such, Jonah Cohen, Josh Gross, Justin Becker, Kan Wu, Kevin Whinnery, Larry Lv, Lee Byron, Lien Mamitsuka, Manoli Liodakis, Max Johnson, Mike Trpcic, Murat Yesildal, Rasmus Rygaard, RJ Marsan, Rohit Ramchandani, Rohan Kshirsagar, Roman Huet, Sara Conlon, Shuaiqi (Tony) Xia, Siyuan Fu, Srinivas Narayanan, Sulman Choudhry, Surya Mamidyala, Tomer Kaftan, Trevor Creech

検索

Adam Fry, Adam Perelman, Brandon Wang, Cristina Scheau, Philip Pronin, Sundeep Tirumalareddy, Will Ellsworth, Zewei Chu

製品

Antonia Woodford, Beth Hoover, Jake Brill, Kelly Stirman, Minnia Feng, Neel Ajjarapu, Nick Turley, Nikunj Handa, Olivier Godement

安全性

Alex Beutel, Andrea Vallone, Andrew Duberstein, Enis Sert, Eric Wallace, Grace Zhao, Irina Kofman, Jieqi Yu, Joaquin Quinonero Candela, Madelaine Boyd, Matt Jones, Mehmet Yatbaz, Mike McClay, Mingxuan Wang, Saachi Jain, Sandhini Agarwal, Sam Toizer, Santiago Hernández, Steve Mostovoy, Young Cha, Tao Li, Yunyun Wang

外部レッドチーム

Lama Ahmad 氏、Michael Lampe 氏、Troy Peterson 氏

研究プログラムマネージャー

Carpus Chang, Kristen Ying

リーダーシップ

Aidan Clark, Dane Stuckey, Jerry Tworek, Jakub Pachocki, Johannes Heidecke, Kevin Weil, Liam Fedus, Mark Chen, Sam Altman, Wojciech Zaremba

加えて o1 実現の貢献者全員⁠。