跳至主要內容
OpenAI

2025年12月16日

研究研究發表

評估 AI 執行科學研究任務的能力

我們推出 FrontierScience,這是一項全新的基準測試,旨在評估 AI 在物理、化學與生物學領域中進行專家級科學推理的能力。

部分圖像採用柔和的綠黃漸層背景,左側可見裁切後、以「Fro…」開頭的大型字樣,右側則為分層文字區塊,列出「事實性」、「可評分性」、「客觀性」與「困難度」等審查標準。
載入中…

推理是科學研究的核心。科學家的工作範疇除了掌握既有知識,還要提出假設、加以測試並持續修正,並跨領域整合想法。隨著模型能力不斷提升,關鍵問題在於它們如何進行更深入的推理,對科學研究作出實際貢獻。

在過去一年中,我們的模型達成多項重要里程碑,包括在國際數學奧林匹亞競賽與國際資訊奧林匹亞競賽中展現金牌等級的表現。同時,我們也開始看到能力最強大的模型(如 GPT‑5)如何實質加速真實世界的科學工作流程。研究人員已將這些系統應用於跨學科、跨語言的文獻搜尋,以及處理複雜的數學證明等任務。在許多情況下,原本可能需要數天甚至數週完成的工作,已可在數小時內完成。這項進展記錄於我們於 2025 年 11 月發表的論文《使用 GPT‑5 加速科學進展的早期實驗》中,該論文提出了 GPT‑5 能夠以可量化方式加速科學工作流程的初步證據。

FrontierScience 登場

加速科學進步是 AI 造福人類主要途徑,其中潛力無限,有鑑於此,我們正持續強化模型在高難度數學與科學任務上的表現,並同步開發相關工具,協助科學家更充分發揮這些模型的價值。

2023 年 11 月,由博士級專家撰寫、被稱為「不易透過 Google 搜尋作答」的 GPQA(在新視窗中開啟) 科學基準測試發布時,GPT‑4 的得分為 39%,低於專家基準的 70%。兩年後,GPT‑5.2 的得分提升至 92%。隨著模型的推理與知識能力持續擴展,更具挑戰性的基準測試將成為衡量與預測模型加速科學研究能力的重要工具。過往的科學基準測試多半著重於選擇題形式、已趨於飽和,或未以科學推理為核心。

為了彌補這項缺口,我們推出 FrontierScience:一項專為衡量專家級科學能力而打造的全新基準測試。FrontierScience 的題目由物理、化學與生物學領域的專家撰寫並驗證,共收錄數百道具挑戰性、原創性,且貼近實際研究的題目。FrontierScience 包含兩個題組:Olympiad 用於衡量奧林匹亞競賽風格的科學推理能力;Research 用於評估真實世界情境中的科學研究能力。對模型科學能力提供更深入的洞見,有助於我們追蹤進展,並推動 AI 加速科學研究的發展。

在我們的初步評估中,GPT‑5.2 是目前在 FrontierScience-Olympiad(得分 77%)與 Research(得分 25%)兩個題組上表現最好的模型,領先其他前沿模型。我們在解決專家級問題方面已取得實質進展,同時仍保留進一步提升的空間,特別是在開放式、研究導向的任務上。對科學家而言,這表示現有模型已能支援研究中涉及結構化推理的部分工作,但也凸顯其在進行開放性思考方面,仍有顯著的改進空間。這些結果與科學家目前實際使用模型的方式一致:模型可用於加速研究流程,同時仍仰賴人類判斷來進行問題框架設定與驗證,並且越來越常用來探索原本需要更長時間才能發現的想法與關聯。在部分情況下,模型甚至能提出新的見解,再由專家加以評估與測試。

最終,衡量 AI 科學能力最重要的標準,在於它能協助產生哪些新穎的發現;這些成果才是真正對科學與社會具有關鍵意義的。FrontierScience 可以為此目標奠定基礎。它為專家級科學推理提供了一個清楚的指引,讓我們得以在一套標準化的題目上測試模型,觀察其成功與失誤之處,並找出需要改進的方向。FrontierScience 本身在若干關鍵面向仍有所侷限,例如聚焦於受限、由專家撰寫的問題,無法涵蓋科學家日常工作的方方面面。然而,該領域仍迫切需要更具挑戰性、原創性且具有實質意義的科學基準測試,而 FrontierScience 正朝這個方向邁出了一步。

FrontierScience 衡量的內容及其建構方式

完整的 FrontierScience 評估涵蓋超過 700 道文字題目(其中 160 題屬於金標準集),橫跨物理、化學與生物學的多個子領域。此基準由兩個題組構成:Olympiad 與 Research。FrontierScience-Olympiad 包含 100 道由國際奧林匹亞競賽獎牌得主設計的題目,採用受限的簡答形式,以評估科學推理能力。Olympiad 題組的設計目標,是納入至少與國際奧林匹亞競賽同等難度的理論題目。FrontierScience-Research 則包含 60 項原創研究子任務,由博士級科學家(博士候選人、教授或博士後研究員)設計,並採用 10 分制的評分標準進行評量。Research 題組的設計,旨在呈現博士科學家在實際研究中可能遇到的任務。每一道題目本身即包含完整背景與必要資訊,不需依賴外部資料,並且需要依序完成多個推理與判斷步驟,其整體難度與真實研究情境相當。

範例問題

B1 reacts with aqueous bromine (Br2) to form B2. B2 reacts with potassium nitrite (KNO2) to form B3. B3 is nitrated in nitric acid (HNO3) and sulfuric acid (H2SO4) to form B4.

  • B1 contains a monosubstituted aromatic 5-membered heterocycle and has a molar mass of 96.08 g/mol. It may be produced by dehydrating 5-carbon sugars (e.g. xylose) in an acid catalyst.
  • B2 has the molecular formula C4H2Br2O3 and contains a tetrasubstituted alkene with 2 substituents being bromines cis to each other.
  • B3 is a dipotassium salt with a molar mass of 269.27 g/mol. It contains 1 hydrogen.
  • B4 is an achiral pseudohalogen dimer with 2 carbons, no hydrogens and a molar mass of 300. g/mol.

When B4 decomposes in solution, it forms an intermediate B5 and 1 equivalent of dinitrogen tetroxide (N2O4) as a side product. Intermediate B5 can be trapped and detected as a Diels-Alder adduct.

Provide the structures of B1, B2, B3, B4, and B5 in the following format, "B1: X; B2: X; B3: X; B4: X; B5: X".

FrontierScience 的每項任務皆由物理、化學或生物學領域的專家撰寫並驗證。在 Olympiad 題組中,所有出題專家至少曾在一項(且多半不只一項)國際奧林匹亞競賽中獲得獎牌。在 Research 題組中,所有出題專家皆擁有相關領域的博士學位。

Olympiad 題組的題目是由 42 位前國際獎牌得主或相關領域的國家代表隊教練共同設計,出題者累計獲得 109 面奧林匹亞競賽獎牌。研究題目則是與 45 位符合資格的科學家與領域專家共同設計。所有參與的科學家皆為博士候選人、博士後研究員或教授。他們的專業背景涵蓋多個高度專門且關鍵的科學領域,包括量子電動力學、合成有機化學,以及演化生物學。

兩個題組在任務建立過程中,皆刻意納入對 OpenAI 內部模型不利的篩選機制(例如移除模型已能成功解答的題目),因此相較於其他模型,這些模型在評估結果上可能略為吃虧。我們將 Olympiad 題組的 100 道金標準題目,以及 Research 題組的 60 道金標準題目開源,其餘題目則保留不用,以追蹤資料污染情況。

流程圖顯示任務開發管道的四個階段:建立、審查、解析與修訂,並附上每個步驟的簡要說明,以及「事實性」、「可評分性」、「客觀性」與「困難度」等審查標準。

任務會依序經歷四個階段:建立、審查、解析、修訂。獨立專家會彼此審查對方的任務,確認是否符合既定標準。

我們如何評估模型表現

Olympiad 題組採用簡答形式評分,答案可為數字、數學表達式,或模糊字串比對,有助於驗證正確性。不過,這樣的驗證方式往往會犧牲題目的表達彈性與開放程度。至於 Research 題組,則採用以評分標準為核心的架構,用於評量較為開放的任務。每道題目皆包含一份評分標準,涵蓋多個彼此獨立、可客觀判定的項目,總分為 10 分。評分標準不僅檢視最終答案是否正確,也會評估中間推理步驟的合理性,使模型表現與失誤分析能更細緻呈現。只要一份解答在評分標準中取得至少 7 分(滿分 10 分),即視為「正確」。

所有回應皆由模型評分器 (GPT‑5) 依據簡答答案或評分標準進行評估。理想情況下,我們會由專家人工逐一評分,但這種方式難以擴大規模,因此評分標準的設計需能由模型進行檢核。我們也建立了一套驗證流程,用以確認題目與評分標準在難度與正確性上的校準程度。

Sample chemistry research problem

Phthalocyanines are important macrocyclic compounds widely utilized in material and life sciences due to their outstanding physicochemical properties, including intense absorption in the visible-near-IR region and high electron-accepting ability. However, unlike porphyrins, the structural diversity of phthalocyanines has been limited by traditional synthetic methods that rely on one-step condensation of isoindoline units. The development of methodologies that allow controlled modification of the phthalocyanine framework, particularly at the meso-positions, represents a significant advancement in expanding the structural and electronic diversity of these compounds.

Analyze how the modification of meso-nitrogen atoms in nickel(II) phthalocyanine affects the electronic structure, aromaticity, spectroscopic properties, and reactivity of the resulting macrocycles. Explain the synthetic methodology that enables these modifications, the relationship between π-electron count and observed properties, and how these systems compare to traditional phthalocyanines and related porphyrinoids.

Sample response

1. Introduction: Why Modify the meso-Nitrogens of Phthalocyanines?

Phthalocyanines (Pc) are a class of macrocyclic dyes known for their remarkable chemical, electronic, and optical properties. Their widespread use in materials science, catalysis, and biology stems from strong absorptions in the visible/near-IR regions, exceptional redox characteristics (i.e., high electron affinity), and rigid π-conjugation. Traditional synthetic methods, however, typically form the macrocycle in a single step from phthalonitrile or isoindoline precursors, restricting the ability to introduce substitutions selectively at key ring positions-especially the "meso" (bridging) nitrogens. Recently, methodologies allowing targeted modification of the meso-nitrogen atoms in Ni(II) phthalocyanines have emerged as a powerful means to expand the structural and electronic diversity of these macrocycles.

2. Effect on Electronic Structure and Aromaticity

  1. π-Electron Count and Ring Currents

    Phthalocyanines are classical 18-electron aromatic systems, conferring high stability and planarity. Altering the meso-nitrogens through substitution or ring expansion can shift the effective π-electron count. For example, introducing additional unsaturated moieties or heteroatoms may expand the cyclic conjugation to 20 or 22 electrons, moving the molecule between Hückel aromaticity and other regimes that affect ring currents and magnetic properties.

  2. Charge Distribution and Frontier Orbitals

    Meso-substitution can tune the electron density distribution around the macrocycle. Electron-donating groups can push electron density toward the ring center, lowering the energy of the lowest unoccupied molecular orbital (LUMO). Electron-withdrawing groups, conversely, stabilize the highest occupied molecular orbital (HOMO) and shift redox potentials toward more positive values, altering both the electrochemical profile and the Q- and B-band positions in the UV-vis spectrum.

3. Spectroscopic Consequences

  1. UV-Vis Absorption (Q and B Bands)

    The principal absorption features of phthalocyanines lie in the visible (Q-band, typically 600-700 nm) and near-UV (B-band, typically 300-400 nm).

    Substitution that expands the ring conjugation or introduces strong electron-donating/withdrawing groups can:

    • Shift the Q-band to longer wavelengths (bathochromic shift), reaching into the near-IR, which is highly desirable for optoelectronic and photodynamic applications.
    • Alter relative intensities of these bands and merge or split them, reflecting changes in orbital symmetries and energies.
  2. NMR Spectroscopy and Aromatic Ring Currents

    Modifications to the π-electron count and distribution are directly observed in 1H and 13C NMR chemical shifts.

    More highly conjugated (or expanded) aromatic rings exhibit distinct downfield shifts for protons located within induced ring currents, while any partial loss of aromaticity or incorporation of antiaromatic segments can cause atypical shielding/deshielding patterns.

4. Reactivity and Coordination Chemistry

Because phthalocyanines are often used as redox catalysts or sensors, the meso-nitrogen modifications can significantly influence reactivity:

  • Electron-rich meso substituents facilitate nucleophilic or electrophilic attacks at the ring periphery, enabling site-selective functionalizations that are otherwise difficult.

(... shortened for the purposes of this figure)

Sample grading rubric
Analysis of Traditional Phthalocyanine Synthesis Limitations (1 point)
通過 1.0 point: Correctly explains that traditional phthalocyanine synthesis involves one-step condensation with simultaneous formation of all meso-nitrogen bridges, providing limited control over substitution patterns at these positions.

0.5 point: Mentions limitations of traditional methods but without specific focus on meso-position control challenges.

0.0 point: Fails to identify key limitations of traditional synthetic approaches or provides incorrect analysis.

Thiolate-Mediated Tetramerization Process (1 point)

1.0 point: Correctly describes the thiolate-mediated reductive tetramerization and explains how counter cation size (K+ or Cs+ vs. Na+) affects selectivity between tetramer formation and direct macrocyclization.

0.5 point: Mentions thiolate-mediated tetramerization but without explaining factors controlling selectivity.

未通過 0.0 point: Incorrectly describes the oligomerization process or omits critical details about selectivity control.
Analysis of NMR Spectroscopic Features (1 point)

1.0 point: Correctly explains that upfield shifts in the 16π system indicate paratropic ring current (antiaromaticity), contrasts this with the broad signals in 17π systems due to paramagnetism, and connects these observations to the underlying electronic structures.

通過 0.5 point: Identifies basic NMR patterns but without clear connection to ring currents or electronic structure.

0.0 point: Incorrectly interprets NMR data or fails to connect spectral features to electronic properties.

Electrochemical Property Analysis (1 point)

1.0 point: Correctly explains that the 16π system shows two reversible reductions reflecting conversion to 17π radical and 18π aromatic states, while 17π systems show narrow redox gaps due to facile interconversion between 16π, 17π, and 18π states, and relates these patterns to the underlying electronic structures.

通過 0.5 point: Describes redox patterns without clearly connecting them to specific electronic state changes.

0.0 point: Incorrectly interprets electrochemical data or fails to connect redox behavior to electronic properties.

Analysis of Absorption Spectroscopy (1 point)

1.0 point: Correctly explains that the 16π system shows weak/broad absorption due to symmetry-forbidden HOMO-LUMO transitions in antiaromatic systems, while 17π systems show Q-like bands plus NIR-II absorptions characteristic of radical species, and contrasts these with typical phthalocyanine spectral features.

通過 0.5 point: Describes absorption features but provides limited connection to underlying electronic structures.

0.0 point: Incorrectly interprets absorption data or fails to relate spectral features to electronic properties.

Reactivity Analysis of Antiaromatic System (1 point)

1.0 point: Correctly explains the high reactivity of the 16π system toward nucleophiles, details specific reactions with hydroxide (ring opening) and hydrazine (ring expansion), and explains how these transformations relieve antiaromatic destabilization.

0.5 point: Mentions reactivity but provides limited analysis of specific transformations or the driving forces behind them.

未通過 0.0 point: Incorrectly analyzes reactivity patterns or fails to connect them to the antiaromatic character of the 16π system.

(... and more)

在 Research 題組中,每項任務都有一套滿分為 10 分的評分標準,評分工作可由專家或模型完成。為了擴展我們評估模型的能力,我們會使用另一個模型來評分回應。

模型效能

我們在 FrontierScience-Olympiad 與 FrontierScience-Research 上,評估了多個前沿模型,包括 GPT‑5.2、Claude Opus 4.5、Gemini 3 Pro、GPT‑4o、OpenAI o4-mini,以及 OpenAI o3。除 GPT‑5.2 採用「xhigh」推理強度外,其餘推理模型皆在「high」推理強度下進行評估。在我們的初步評估中,GPT‑5.2 是目前在 FrontierScience-Olympiad(得分 77%)與 Research(得分 25%)兩個題組上表現最好的模型,領先其他前沿模型。Gemini 3 Pro 在 Olympiad 題組上的表現可媲美 GPT‑5.2(得分 76%)。

整體而言,我們在解決專家級問題方面已看到顯著進展,特別是在開放式、研究導向的任務上。不過仍有成長空間:從失敗案例的分析可見,前沿模型有時會出現推理、邏輯與計算錯誤,對較為專門的科學概念理解不足,且偶爾會產生事實性錯誤。

我們比較多個前沿模型的準確度。GPT‑5.2 是我們在 FrontierScience-Research 與 Olympiad 題組上表現最佳的模型。

我們比較 GPT‑5.2 和 o3 在不同推理強度下的準確性。思考時間越長,準確性越高。

限制與未來規畫

儘管 FrontierScience 在科學基準的難度上邁出了一步,但仍有諸多限制。FrontierScience 由具有限制條件的問題所構成,這些問題著重於評估最終答案 (Olympiad),或評估完成研究任務所需的推理過程 (Research)。此外,對於篇幅較長的任務,採用包含多個評分項目的評分標準,相較於直接檢查最終答案,客觀性較低。

FrontierScience 能以較高解析度呈現模型在面對困難、由專家撰寫題目時的推理表現,但仍無法完整反映科學研究在實務中的全貌。特別是,它並未涵蓋科學研究中一項重要面向:模型如何提出真正新穎的假設,或如何與多種模態互動,包括視訊資料與物理世界中的真實實驗系統。

展望未來,我們預期科學推理的進步,將同時來自更成熟的通用推理系統,以及針對科學能力進行的重點強化。FrontierScience 只是眾多工具之一。隨著模型持續進化,我們計劃不斷迭代這項基準測試,將其擴展至新的領域,並搭配更多貼近真實情境的評估方式,檢視這些系統實際能為科學家做到哪些事。像 FrontierScience 這樣的基準測試,有助於我們看清當前 AI 系統的不足之處,並據此投注心力,致力讓模型真正成為科學探索中值得信賴的合作對象。