跳到主要內容
OpenAI

2025年12月16日

研究發佈

評估 AI 執行科學研究任務的能力

我們推出一項全新基準 FrontierScience,用以評估 AI 在物理、化學及生物學範疇中的專家級科學推理能力。

畫面局部顯示柔和的綠黃色漸層背景,左側可見以「Fro…」開首的放大裁切字樣,右側則為分層文字區塊,列出「事實準確」、「可評分」、「客觀」及「具挑戰性」等評審標準。
正在載入...

推理是科學研究的核心。科學家不僅會引用事實,更會提出假設、進行測試與修正,並在各個領域中整合想法。隨着模型能力持續提升,關鍵問題在於:AI 是否能夠深入推理,從而真正促進科學研究。

過去一年,我們的模型達成多項重要里程碑,包括在國際數學奧林匹克競賽及國際資訊奧林匹克競賽取得金牌成績。同時,我們亦開始看到 GPT‑5 等最先進的模型能實際加速真實的科學工作流程。研究人員已將這些系統應用於跨學科、跨語言的文獻搜尋,以及處理複雜數學證明等任務;在不少情況下,原本需時數天甚至數週的工作,都能縮短至數小時內完成。這些進展已記錄在我們於 2025 年 11 月發表的論文《利用 GPT‑5 加快科學實驗早期進度》中,該論文提供了 GPT‑5 能顯著加速科學流程的早期證據。

FrontierScience 登場

加速科學進步,是 AI 造福人類最具潛力的方向之一。為此,我們一方面持續提升模型在高難度數學與科學任務上的表現,另一方面亦開發工具,協助科學家更有效運用這些模型。

GPQA(在新視窗中開啟) 於 2023 年 11 月發表,在這項由博士級專家撰寫、具「防止依靠 Google 搜尋」設計的科學基準中,GPT‑4 的得分為 39%,低於專家基準的 70%;兩年後,GPT‑5.2 的得分提升至 92%。隨着模型在推理與知識層面的能力持續擴展,更具挑戰性的基準將成為衡量與預測模型是否能加速科學研究的關鍵工具。但現有科學基準多數集中於選擇題,同時已趨飽和,又或者未能以科學推理為核心。

為了彌補這個差距,我們推出 FrontierScience:一項專為衡量專家級科學能力而設的新基準。FrontierScience 由物理、化學及生物學專家撰寫並驗證,涵蓋數百個高難度、原創且具實質意義的問題。這項基準分為兩個類別:奧林匹克 (Olympiad),衡量奧林匹克風格的科學推理能力;研究 (Research),衡量貼近現實的科學研究能力。透過更深入呈現模型的科學能力,有助我們持續追蹤進展,推動 AI 加速科學發展。

在初步評估結果中,GPT‑5.2 在 FrontierScience-Olympiad(得分 77%)及 Research(得分 25%)兩個項目均為表現最佳的模型,領先其他前沿模型。結果顯示,我們在解決專家級問題方面已取得實質進展,但在開放式研究型任務上仍有明顯進步空間。對科學家而言,這意味著現階段模型已可支援涉及結構化推理的研究工作;但在另一方面,評估結果亦顯示,要提升模型進行開放式思考的能力,仍需大量工作。這些結果與科學家目前的實際使用方式一致:利用模型加速研究流程,並依賴人類判斷進行問題定義與驗證。同時,模型亦越來越多地參與探索原本需更長時間才能發掘的概念與連結,在個別情況下甚至能提供新見解,再由專家作進一步評估與測試。

歸根究柢,衡量 AI 科學能力最重要的標準,是 AI 所能促成的嶄新發現,這些成果才真正能對科學發展和社會帶來價值。FrontierScience 為實現這個目標奠定基礎。這為專家級科學推理提供了一個清晰的指引方向,讓我們能在一套標準化的問題上測試模型,了解模型在哪些方面表現出色、又在哪些地方仍有不足,從而找出需要進一步改進的重點。當然,FrontierScience 在部分關鍵層面上亦有其局限,例如集中於具限制條件、由專家撰寫的題目,未能全面涵蓋科學家日常工作的所有面向。然而,科研領域正需要更多具挑戰性、原創性及實質意義的科學基準,而 FrontierScience 正是在這個方向上邁出重要一步。

FrontierScience 衡量的內容及建立方式

完整的 FrontierScience 評估涵蓋超過 700 個文字題目(其中 160 個屬於金集),內容橫跨物理、化學及生物學的多個子領域。整個基準分為兩個部分:奧林匹克 (Olympiad) 及研究 (Research)。FrontierScience-Olympiad 包含 100 個由國際奧林匹克獎牌得主設計的題目,採用具限制性的簡答形式,用以評估科學推理能力。這組題目以理論問題為主,整體難度至少與國際奧林匹克競賽的題目相若。FrontierScience-Research 則由 60 項原創研究子任務組成,題目由博士級科學家(包括博士候選人、教授或博士後研究員)設計,並採用 10 分制的評分標準。這些研究題目均為獨立、多步驟的任務,其難度相當於博士科學家在實際研究中可能遇到的挑戰。

範例問題

B1 reacts with aqueous bromine (Br2) to form B2. B2 reacts with potassium nitrite (KNO2) to form B3. B3 is nitrated in nitric acid (HNO3) and sulfuric acid (H2SO4) to form B4.

  • B1 contains a monosubstituted aromatic 5-membered heterocycle and has a molar mass of 96.08 g/mol. It may be produced by dehydrating 5-carbon sugars (e.g. xylose) in an acid catalyst.
  • B2 has the molecular formula C4H2Br2O3 and contains a tetrasubstituted alkene with 2 substituents being bromines cis to each other.
  • B3 is a dipotassium salt with a molar mass of 269.27 g/mol. It contains 1 hydrogen.
  • B4 is an achiral pseudohalogen dimer with 2 carbons, no hydrogens and a molar mass of 300. g/mol.

When B4 decomposes in solution, it forms an intermediate B5 and 1 equivalent of dinitrogen tetroxide (N2O4) as a side product. Intermediate B5 can be trapped and detected as a Diels-Alder adduct.

Provide the structures of B1, B2, B3, B4, and B5 in the following format, "B1: X; B2: X; B3: X; B4: X; B5: X".

FrontierScience 的每一項任務均由物理、化學或生物學領域的專家撰寫並進行驗證。就奧林匹克 (Olympiad) 題集而言,所有參與的專家均曾在至少一項(不少更不只一次)國際奧林匹克競賽中獲得獎牌;而研究 (Research) 題集的專家則全部持有相關領域的博士學位。

奧林匹克 (Olympiad) 題目由 42 位相關領域的前國際奧林匹克獎牌得主或國家隊教練共同設計,設計團隊合共曾獲得 109 面奧林匹克獎牌。研究 (Research) 題目則由 45 位合資格的科學家及相關領域專家合作撰寫,所有參與的科學家均為博士候選人、博士後研究員或教授。他們的專業範疇涵蓋多個重要而專門的科學領域,包括量子電動力學、合成有機化學,以及演化生物學等。

在兩個題集的任務設計過程中,我們曾針對 OpenAI 內部模型進行一定程度的篩選(例如剔除模型已能正確完成的題目),因此預期這項評估在比較之下,對這些模型會略為嚴格。我們已將奧林匹克 (Olympiad) 金集的 100 個題目,以及研究 (Research) 金集的 60 個題目開放為開源內容,其餘題目則保留不用,以便持續追蹤資料污染情況。

流程圖展示任務開發流程的四個階段(建立、審核、釐清及修訂),並為每個步驟提供簡要說明,同時標示是否符合「事實準確」、「可評分」、「客觀」及「具挑戰性」等準則。

每項任務均會經過四個階段:建立、審核、釐清及修訂。期間,由獨立專家互相審閱彼此的任務,以確認內容符合既定準則。

我們如何評估模型表現

奧林匹克 (Olympiad) 題集可以透過簡答題進行評分,答案包括數值、算式或模糊字串比對,有助於核實答案是否正確。不過,這種驗證方式往往需要在題目表達的彈性與開放性之間作出取捨。至於研究 (Research) 題集,我們引入以評分標準為基礎的架構,來評核較為開放式的任務。每個題目都設有一套評分標準,包含多個彼此獨立且可客觀評估的項目,總分為 10 分。評分標準不僅評估最終答案的準確性,亦會查看中間的推理步驟是否正確,從而更細緻地分析模型的表現及其失誤情況。如果解答在評分標準中獲得至少 7 分(滿分 10 分),便視為「正確」。

所有回覆均由模型評分器(GPT‑5)根據簡答題答案或評分標準進行評核。理想情況下,我們希望由專家逐一為每個回覆評分,但此做法難以擴展,因此在設計評分標準時,已確保採用可以由模型評分器進行評核的評分方式。同時,我們亦建立了一套驗證流程,以確保題目與評分標準能夠對應難度設定和正確性,同時確保標準一致。

Sample chemistry research problem

Phthalocyanines are important macrocyclic compounds widely utilized in material and life sciences due to their outstanding physicochemical properties, including intense absorption in the visible-near-IR region and high electron-accepting ability. However, unlike porphyrins, the structural diversity of phthalocyanines has been limited by traditional synthetic methods that rely on one-step condensation of isoindoline units. The development of methodologies that allow controlled modification of the phthalocyanine framework, particularly at the meso-positions, represents a significant advancement in expanding the structural and electronic diversity of these compounds.

Analyze how the modification of meso-nitrogen atoms in nickel(II) phthalocyanine affects the electronic structure, aromaticity, spectroscopic properties, and reactivity of the resulting macrocycles. Explain the synthetic methodology that enables these modifications, the relationship between π-electron count and observed properties, and how these systems compare to traditional phthalocyanines and related porphyrinoids.

Sample response

1. Introduction: Why Modify the meso-Nitrogens of Phthalocyanines?

Phthalocyanines (Pc) are a class of macrocyclic dyes known for their remarkable chemical, electronic, and optical properties. Their widespread use in materials science, catalysis, and biology stems from strong absorptions in the visible/near-IR regions, exceptional redox characteristics (i.e., high electron affinity), and rigid π-conjugation. Traditional synthetic methods, however, typically form the macrocycle in a single step from phthalonitrile or isoindoline precursors, restricting the ability to introduce substitutions selectively at key ring positions-especially the "meso" (bridging) nitrogens. Recently, methodologies allowing targeted modification of the meso-nitrogen atoms in Ni(II) phthalocyanines have emerged as a powerful means to expand the structural and electronic diversity of these macrocycles.

2. Effect on Electronic Structure and Aromaticity

  1. π-Electron Count and Ring Currents

    Phthalocyanines are classical 18-electron aromatic systems, conferring high stability and planarity. Altering the meso-nitrogens through substitution or ring expansion can shift the effective π-electron count. For example, introducing additional unsaturated moieties or heteroatoms may expand the cyclic conjugation to 20 or 22 electrons, moving the molecule between Hückel aromaticity and other regimes that affect ring currents and magnetic properties.

  2. Charge Distribution and Frontier Orbitals

    Meso-substitution can tune the electron density distribution around the macrocycle. Electron-donating groups can push electron density toward the ring center, lowering the energy of the lowest unoccupied molecular orbital (LUMO). Electron-withdrawing groups, conversely, stabilize the highest occupied molecular orbital (HOMO) and shift redox potentials toward more positive values, altering both the electrochemical profile and the Q- and B-band positions in the UV-vis spectrum.

3. Spectroscopic Consequences

  1. UV-Vis Absorption (Q and B Bands)

    The principal absorption features of phthalocyanines lie in the visible (Q-band, typically 600-700 nm) and near-UV (B-band, typically 300-400 nm).

    Substitution that expands the ring conjugation or introduces strong electron-donating/withdrawing groups can:

    • Shift the Q-band to longer wavelengths (bathochromic shift), reaching into the near-IR, which is highly desirable for optoelectronic and photodynamic applications.
    • Alter relative intensities of these bands and merge or split them, reflecting changes in orbital symmetries and energies.
  2. NMR Spectroscopy and Aromatic Ring Currents

    Modifications to the π-electron count and distribution are directly observed in 1H and 13C NMR chemical shifts.

    More highly conjugated (or expanded) aromatic rings exhibit distinct downfield shifts for protons located within induced ring currents, while any partial loss of aromaticity or incorporation of antiaromatic segments can cause atypical shielding/deshielding patterns.

4. Reactivity and Coordination Chemistry

Because phthalocyanines are often used as redox catalysts or sensors, the meso-nitrogen modifications can significantly influence reactivity:

  • Electron-rich meso substituents facilitate nucleophilic or electrophilic attacks at the ring periphery, enabling site-selective functionalizations that are otherwise difficult.

(... shortened for the purposes of this figure)

Sample grading rubric
Analysis of Traditional Phthalocyanine Synthesis Limitations (1 point)
通過 1.0 point: Correctly explains that traditional phthalocyanine synthesis involves one-step condensation with simultaneous formation of all meso-nitrogen bridges, providing limited control over substitution patterns at these positions.

0.5 point: Mentions limitations of traditional methods but without specific focus on meso-position control challenges.

0.0 point: Fails to identify key limitations of traditional synthetic approaches or provides incorrect analysis.

Thiolate-Mediated Tetramerization Process (1 point)

1.0 point: Correctly describes the thiolate-mediated reductive tetramerization and explains how counter cation size (K+ or Cs+ vs. Na+) affects selectivity between tetramer formation and direct macrocyclization.

0.5 point: Mentions thiolate-mediated tetramerization but without explaining factors controlling selectivity.

未通過 0.0 point: Incorrectly describes the oligomerization process or omits critical details about selectivity control.
Analysis of NMR Spectroscopic Features (1 point)

1.0 point: Correctly explains that upfield shifts in the 16π system indicate paratropic ring current (antiaromaticity), contrasts this with the broad signals in 17π systems due to paramagnetism, and connects these observations to the underlying electronic structures.

通過 0.5 point: Identifies basic NMR patterns but without clear connection to ring currents or electronic structure.

0.0 point: Incorrectly interprets NMR data or fails to connect spectral features to electronic properties.

Electrochemical Property Analysis (1 point)

1.0 point: Correctly explains that the 16π system shows two reversible reductions reflecting conversion to 17π radical and 18π aromatic states, while 17π systems show narrow redox gaps due to facile interconversion between 16π, 17π, and 18π states, and relates these patterns to the underlying electronic structures.

通過 0.5 point: Describes redox patterns without clearly connecting them to specific electronic state changes.

0.0 point: Incorrectly interprets electrochemical data or fails to connect redox behavior to electronic properties.

Analysis of Absorption Spectroscopy (1 point)

1.0 point: Correctly explains that the 16π system shows weak/broad absorption due to symmetry-forbidden HOMO-LUMO transitions in antiaromatic systems, while 17π systems show Q-like bands plus NIR-II absorptions characteristic of radical species, and contrasts these with typical phthalocyanine spectral features.

通過 0.5 point: Describes absorption features but provides limited connection to underlying electronic structures.

0.0 point: Incorrectly interprets absorption data or fails to relate spectral features to electronic properties.

Reactivity Analysis of Antiaromatic System (1 point)

1.0 point: Correctly explains the high reactivity of the 16π system toward nucleophiles, details specific reactions with hydroxide (ring opening) and hydrazine (ring expansion), and explains how these transformations relieve antiaromatic destabilization.

0.5 point: Mentions reactivity but provides limited analysis of specific transformations or the driving forces behind them.

未通過 0.0 point: Incorrectly analyzes reactivity patterns or fails to connect them to the antiaromatic character of the 16π system.

(... and more)

研究(Research)題集中的每一項任務,均採用總分為 10 分的評分標準,並由專家或模型評分器按照評分標準進行評核。為了提升評估模型的規模與效率,我們採用另一個模型來為回覆進行評分。

模型表現

我們在 FrontierScience-Olympiad 及 FrontierScience-Research 上評估了多個前沿模型,包括 GPT‑5.2、Claude Opus 4.5、Gemini 3 Pro、GPT‑4o、OpenAI o4-mini 及 OpenAI o3。除 GPT‑5.2 以「xhigh」推理強度進行評估外,其餘推理模型均在「high」推理強度下測試。在初步評估結果中,GPT‑5.2 在 FrontierScience-Olympiad(得分 77%)及 Research(得分 25%)兩個項目均為表現最佳的模型,領先其他前沿模型。Gemini 3 Pro 在奧林匹克 (Olympiad) 題集上的表現則與 GPT‑5.2 相若,得分為 76%。

我們在解決專家級問題方面已看到顯著進展,尤其是在較為開放的研究型任務上。不過,模型仍然有進一步提升的空間:從失誤紀錄的分析可見,前沿模型有時會出現推理、邏輯或計算上的錯誤,未能理解較為小眾的科學概念,亦偶爾出現事實不準確的情況。

我們比較了多個前沿模型的準確度表現,其中 GPT‑5.2 在 FrontierScience-Research 及奧林匹克(Olympiad)題集中均為表現最佳的模型。

我們比較 GPT‑5.2 和 o3 在推理方面的準確性。思考時間越長,準確度越高。

限制與未來發展

雖然 FrontierScience 在科學基準的難度上邁進了一步,但仍然存在不少限制。FrontierScience 由具限制性的題目陳述所組成,主要集中於評估最終答案的正確性(奧林匹克),或評估完成研究任務所需的推理過程(研究)。此外,對於較長且複雜的任務,評估時會採用包含多個評分項目的評分標準,亦可能未如直接檢查最終答案那樣客觀。

FrontierScience 能更清晰地呈現模型在處理高難度、由專家撰寫題目時的推理表現,但仍未能全面反映科學研究在實際操作中的全貌。尤其是,該基準並未評估科學研究中一個相當重要的部分:模型如何產生真正嶄新的假設,或如何與多種模態互動,包括影片數據以及現實世界中的實體實驗系統。

展望未來,我們預期科學推理能力的進步,將同時來自更強大的通用推理系統,以及針對提升科學相關能力所作的專門改進。FrontierScience 只是眾多工具之一;隨着模型持續提升,我們計劃持續迭代這項基準,拓展至新的領域,並結合更多貼近現實的評估方式,以了解這些系統實際上能為科學家帶來哪些幫助。像 FrontierScience 這樣的基準,幫助我們了解現時 AI 系統的不足,從而把工作重點放在令模型成為科學研究中可靠的合作夥伴。