2025年12月11日

運用 GPT‑5.2 推動科學與數學進步

GPT‑5.2 是我們迄今最強大的數學與科學模型。

載入中…

我們對強大人工智慧的期望之一，是加速科學研究、惠及所有人，協助研究人員探索更多想法、加快驗證速度，並將發現轉化為實際成效。

過去這一年來，我們與數學、物理、生物學與資訊科學領域的科學家密切合作，了解 AI 能在哪裡發揮作用，以及其中的不足之處。上個月，我們發表了一篇論文⁠，彙整數學、物理、生物學、資訊科學、天文學與材料科學的早期案例研究，顯示 GPT‑5 已開始為實際科學研究帶來貢獻。隨著 GPT‑5.2 推出，相關成果也越來越穩定可靠。

在講求精準度的情境中，表現更為卓越

GPT‑5.2 Pro 和 GPT‑5.2 Thinking 是我們迄今最強大的科學與數學模型。

科學與技術研究可信度，奠基於扎實的數學推理。此推理能力讓模型能掌握多步驟邏輯、維持數值一致，避免細微錯誤在真實分析中連環累積，應用情境涵蓋模擬、統計、預測與建模等。在 FrontierMath 等基準測試上的進步，反映的不是單一技巧，而是更強的泛化推理與抽象能力，並能直接應用於科學工作流程，例如寫程式、資料分析與實驗設計。

這些能力也與邁向通用人工智慧的進展密切相關。能夠在抽象推理中保持可靠度、在長思考鏈中維持一致，並在不同領域間泛化的系統，展現出 AGI 的核心特徵。這不是只在個別任務有用的技巧，而是跨越科學、工程與現實決策、可廣泛運用的推理能力。

我們深信，GPT‑5.2 Pro 與 GPT‑5.2 Thinking 是目前全球最能有效協助並加速科學研究工作的模型。在 GPQA Diamond 這個研究生級別、防止依靠 Google 搜尋的科學問答基準測試中，GPT‑5.2 Pro 取得 93.2%，GPT‑5.2 Thinking 則達到 92.4%。

在 GPQA Diamond⁠(在新視窗中開啟) 中，模型回答與物理、化學、生物相關的研究生級選擇題。未啟用任何工具，推理強度設為最大值。

在 FrontierMath（第 1–3 級）的專家級數學評估中，GPT‑5.2 Thinking 同樣創下新紀錄，成功解答了 40.3% 的題目。

在 FrontierMath⁠(在新視窗中開啟) 中，模型解答專家級數學問題。已啟用 Python 工具，推理強度設為最大值。

案例分析

GPT‑5.2 is not only strong at graduate-level science problems. We now regularly see our frontier models contributing solutions to previously unsolved—and increasingly subtle—questions in mathematics and the sciences.

In this case study, we describe how GPT‑5.2 Pro helped resolve an open research problem in statistical learning theory, documented in a new paper, On Learning-Curve Monotonicity for Maximum Likelihood Estimators⁠(在新視窗中開啟).

The question (“If you collect more data, do your results reliably get better?”) shows up any time you fit a model from data. You can draw a learning curve that tracks average error as you add more examples. In the best case, the curve is monotone. More data means less error, every step of the way. That is the behavior people hope for, and often assume.

But over the last few years, researchers have learned that this intuition can fail. A line of work kicked off by an open problem posed at the Conference on Learning Theory (COLT) in 2019 by Viering, Mey, and Loog showed that the answer is often no. Even very simple, well-behaved toy setups can have non-monotonic learning curves, where adding data increases expected error. That surprise triggered a wave of follow-up papers. They expanded the list of settings where these reversals happen and proposed increasingly elaborate methods designed to restore monotone behavior.

Still, one of the most basic cases remained unresolved. What happens in the cleanest textbook situation, where the statistical model is actually correct and the data follow the familiar bell curve pattern, with a known mean but unknown standard deviation? Researchers already knew that small changes to this setup could break monotonic behavior. But the answer remained unknown in this core case.

Our new paper demonstrates that in this clean setting, intuition prevails: learning is predictably improved by more data, rather than behaving in surprising or unstable ways. What makes this paper unusual is how the proof was obtained. The authors did not work out a strategy and then ask the model to fill in steps. They did not provide intermediate arguments or a proof outline. Instead, they asked GPT‑5.2 Pro to solve the open problem directly, and then carefully verified the proof, including review and validation by external subject-matter experts.

The authors then asked simple follow-up questions to see how far the idea could go. GPT‑5.2 Pro extended the result beyond the original problem to higher dimensional settings and other common statistical models. Throughout, the human role stayed focused on verification and clear writing, rather than supplying mathematical scaffolding.