2025年12月11日

以 GPT‑5.2 推動科學與數學的進展

GPT‑5.2 是 OpenAI 目前在數學與科學相關工作上最強的模型。

正在載入...

我們期望更強大的 AI 能加速科學研究，為所有人帶來裨益；協助研究人員探索更多想法、更快測試假設，並將發現轉化為實際影響力。

過去一年，我們與數學、物理、生物及電腦科學等領域的科研人員密切合作，以了解 AI 能在哪些方面提供協助，以及其限制所在。上個月，我們發表了一份論文⁠，整理了跨越數學、物理、生物、電腦科學、天文及材料科學的早期案例研究，顯示 GPT‑5 已開始為真實科學發展作出貢獻。隨着 GPT‑5.2 推出，我們看到這些能力變得更加穩定及可靠。

在精確度至關重要的領域展現更強表現

GPT‑5.2 Pro 和 GPT‑5.2 Thinking 是我們迄今最強的科學及數學工作模型。

強大的數學推理能力是確保科學及技術工作可靠性的基礎。它能讓模型遵循多步邏輯、維持數值一致，同時避免細微但可能累積成嚴重影響的錯誤。這些能力涵蓋模擬、統計、預測及建模等分析工作。在 FrontierMath 等基準測試中顯示的進步並非單一技能提升，而是反映了更強的推理能力與抽象能力，並可直接應用於科研流程，如編碼、數據分析及實驗設計。

這些能力同時與向通用智能邁進緊密相關。如果系統能穩定地進行抽象推理、在長思路鏈中維持一致性、並在不同領域之間進行泛化，便展現了 AGI 的核心特質：這並非單一任務中的小技巧，而是可以跨科學、工程與現實決策的廣泛推理能力。

我們深信，GPT‑5.2 Pro 與 GPT‑5.2 Thinking 是目前全球最能有效協助並加速科學研究工作的模型。在 GPQA Diamond 這個研究生級別、防止依靠 Google 搜尋的科學問答基準中，GPT‑5.2 Pro 取得 93.2%，GPT‑5.2 Thinking 則達到 92.4%。

在 GPQA Diamond⁠（在新視窗中開啟）中，模型回答有關物理、化學和生物的多項選擇題。未啟用任何工具，推理強度設為最大。

在 FrontierMath（第 1–3 級）的專家級數學評估中，GPT‑5.2 Thinking 同樣創下新高，能解答 40.3% 的問題。

在 FrontierMath⁠（在新視窗中開啟）中，模型解決專家級數學問題。已啟用 Python 工具，推理強度設為最大。

實例分析

GPT‑5.2 is not only strong at graduate-level science problems. We now regularly see our frontier models contributing solutions to previously unsolved—and increasingly subtle—questions in mathematics and the sciences.

In this case study, we describe how GPT‑5.2 Pro helped resolve an open research problem in statistical learning theory, documented in a new paper, On Learning-Curve Monotonicity for Maximum Likelihood Estimators⁠（在新視窗中開啟）.

The question (“If you collect more data, do your results reliably get better?”) shows up any time you fit a model from data. You can draw a learning curve that tracks average error as you add more examples. In the best case, the curve is monotone. More data means less error, every step of the way. That is the behavior people hope for, and often assume.

But over the last few years, researchers have learned that this intuition can fail. A line of work kicked off by an open problem posed at the Conference on Learning Theory (COLT) in 2019 by Viering, Mey, and Loog showed that the answer is often no. Even very simple, well-behaved toy setups can have non-monotonic learning curves, where adding data increases expected error. That surprise triggered a wave of follow-up papers. They expanded the list of settings where these reversals happen and proposed increasingly elaborate methods designed to restore monotone behavior.

Still, one of the most basic cases remained unresolved. What happens in the cleanest textbook situation, where the statistical model is actually correct and the data follow the familiar bell curve pattern, with a known mean but unknown standard deviation? Researchers already knew that small changes to this setup could break monotonic behavior. But the answer remained unknown in this core case.

Our new paper demonstrates that in this clean setting, intuition prevails: learning is predictably improved by more data, rather than behaving in surprising or unstable ways. What makes this paper unusual is how the proof was obtained. The authors did not work out a strategy and then ask the model to fill in steps. They did not provide intermediate arguments or a proof outline. Instead, they asked GPT‑5.2 Pro to solve the open problem directly, and then carefully verified the proof, including review and validation by external subject-matter experts.

The authors then asked simple follow-up questions to see how far the idea could go. GPT‑5.2 Pro extended the result beyond the original problem to higher dimensional settings and other common statistical models. Throughout, the human role stayed focused on verification and clear writing, rather than supplying mathematical scaffolding.