跳到主要內容
OpenAI

2025年12月11日

發佈產品公司

以 GPT‑5.2 推動科學與數學的進展

GPT‑5.2 是 OpenAI 目前在數學與科學相關工作上最強的模型。

正在載入...

我們期望更強大的 AI 能加速科學研究,為所有人帶來裨益;協助研究人員探索更多想法、更快測試假設,並將發現轉化為實際影響力。 

過去一年,我們與數學、物理、生物及電腦科學等領域的科研人員密切合作,以了解 AI 能在哪些方面提供協助,以及其限制所在。上個月,我們發表了一份論文,整理了跨越數學、物理、生物、電腦科學、天文及材料科學的早期案例研究,顯示 GPT‑5 已開始為真實科學發展作出貢獻。隨着 GPT‑5.2 推出,我們看到這些能力變得更加穩定及可靠。

在精確度至關重要的領域展現更強表現

GPT‑5.2 Pro 和 GPT‑5.2 Thinking 是我們迄今最強的科學及數學工作模型。

強大的數學推理能力是確保科學及技術工作可靠性的基礎。它能讓模型遵循多步邏輯、維持數值一致,同時避免細微但可能累積成嚴重影響的錯誤。這些能力涵蓋模擬、統計、預測及建模等分析工作。在 FrontierMath 等基準測試中顯示的進步並非單一技能提升,而是反映了更強的推理能力與抽象能力,並可直接應用於科研流程,如編碼、數據分析及實驗設計。

這些能力同時與向通用智能邁進緊密相關。如果系統能穩定地進行抽象推理、在長思路鏈中維持一致性、並在不同領域之間進行泛化,便展現了 AGI 的核心特質:這並非單一任務中的小技巧,而是可以跨科學、工程與現實決策的廣泛推理能力。

我們深信,GPT‑5.2 Pro 與 GPT‑5.2 Thinking 是目前全球最能有效協助並加速科學研究工作的模型。在 GPQA Diamond 這個研究生級別、防止依靠 Google 搜尋的科學問答基準中,GPT‑5.2 Pro 取得 93.2%,GPT‑5.2 Thinking 則達到 92.4%。

GPQA Diamond(在新視窗中開啟) 中,模型回答有關物理、化學和生物的多項選擇題。未啟用任何工具,推理強度設為最大。

FrontierMath(第 1–3 級)的專家級數學評估中,GPT‑5.2 Thinking 同樣創下新高,能解答 40.3% 的問題。

FrontierMath(在新視窗中開啟) 中,模型解決專家級數學問題。已啟用 Python 工具,推理強度設為最大。

實例分析

GPT‑5.2 is not only strong at graduate-level science problems. We now regularly see our frontier models contributing solutions to previously unsolved—and increasingly subtle—questions in mathematics and the sciences.

In this case study, we describe how GPT‑5.2 Pro helped resolve an open research problem in statistical learning theory, documented in a new paper, On Learning-Curve Monotonicity for Maximum Likelihood Estimators(在新視窗中開啟).

The question (“If you collect more data, do your results reliably get better?”) shows up any time you fit a model from data. You can draw a learning curve that tracks average error as you add more examples. In the best case, the curve is monotone. More data means less error, every step of the way. That is the behavior people hope for, and often assume.

But over the last few years, researchers have learned that this intuition can fail. A line of work kicked off by an open problem posed at the Conference on Learning Theory (COLT) in 2019 by Viering, Mey, and Loog showed that the answer is often no. Even very simple, well-behaved toy setups can have non-monotonic learning curves, where adding data increases expected error. That surprise triggered a wave of follow-up papers. They expanded the list of settings where these reversals happen and proposed increasingly elaborate methods designed to restore monotone behavior.

Still, one of the most basic cases remained unresolved. What happens in the cleanest textbook situation, where the statistical model is actually correct and the data follow the familiar bell curve pattern, with a known mean but unknown standard deviation? Researchers already knew that small changes to this setup could break monotonic behavior. But the answer remained unknown in this core case.

Our new paper demonstrates that in this clean setting, intuition prevails: learning is predictably improved by more data, rather than behaving in surprising or unstable ways. What makes this paper unusual is how the proof was obtained. The authors did not work out a strategy and then ask the model to fill in steps. They did not provide intermediate arguments or a proof outline. Instead, they asked GPT‑5.2 Pro to solve the open problem directly, and then carefully verified the proof, including review and validation by external subject-matter experts.

The authors then asked simple follow-up questions to see how far the idea could go. GPT‑5.2 Pro extended the result beyond the original problem to higher dimensional settings and other common statistical models. Throughout, the human role stayed focused on verification and clear writing, rather than supplying mathematical scaffolding.

展望未來

這項成果為 AI 如何支援科研提供了重要方向,尤其是在具備公理化理論基礎的領域,如數學與理論電腦科學。在這些領域中,前沿模型能協助研究人員探索證明、測試假設、及找出人類可能需花大量時間才能發現的關聯。

然而,這些系統並非獨立研究人員。專家判斷、驗證與領域知識仍然不可或缺。即使是極強大的模型也可能會出錯,或依賴未明言的假設,但它們亦能提供值得深入研究與打磨的結構化推理與論證。要持續穩健地利用 AI 推動科研,需要建立良好的工作流程,確保在過程中保持驗證、透明度和協作。

作為實例分析,這項成果展示了一種新興的科研模式:像 GPT‑5.2 這樣的模型可作為支援數學推理與加速早期探索的工具,而「正確性、詮釋與情境」的最終責任仍由研究人員負責。如果能謹慎使用,這些系統能簡化理論研究的多個環節,而不會取代科研中由人類作判斷的核心角色。