2025年12月16日

AI の科学研究タスク遂行能力の評価

FrontierScience という新しいベンチマークを紹介します。これは、物理学・化学・生物学の各分野における、専門家レベルの科学的推論に対する AI の能力を評価するものです。

ソフトな緑–黄のグラデーション背景に部分的に表示された「Fro…」で始まる大きな文字、右側には「Factual（事実性）」「Gradable（採点可能）」「Objective（客観性）」「Difficult（難易度）」といった評価基準が重ねて配置されたテキストブロックがあります。

読み込んでいます...

推論は科学的な作業の核心です。科学者は事実を記憶するだけでなく、仮説を立てて検証・改善し、複数の分野にまたがってアイデアを統合します。モデルの能力が高まる中で、これらがいかに高度な推論を行い、科学研究に実質的に貢献できるかが中心的な問いとなります。

過去 1 年間で、私たちのモデルは大きな節目を迎えました。国際数学オリンピックや国際情報オリンピックで金メダル級の成績を達成したこともその一例です。並行して、GPT‑5 など最も高性能なモデルが、実際の科学的ワークフローを有意に加速する様子が見え始めています。研究者たちは、これらのシステムを用いて、学問分野や言語を横断した文献検索や、複雑な数学的証明の検証といったタスクを実行しています。多くの場合、モデルは数日や数週間かかる作業を数時間に短縮します。この進展は、2025 年 11 月に発表された論文 Early science acceleration experiments with GPT‑5⁠ に記録されており、GPT‑5 が科学的ワークフローを測定可能な形で加速できるという初期の証拠を示しています。

FrontierScience の紹介

科学の進展を加速させることは、AI が人類にもたらす最も有望な機会の 1 つであるため、私たちは難易度の高い数学・科学タスクに対してモデルを改良するとともに、科学者がそれらを最大限活用できるツールの開発にも取り組んでいます。

2023 年 11 月に、博士号保持の専門家が作成した質問による「Google-Proof」科学ベンチマーク GPQA⁠（新しいウィンドウで開く）が公開された際、GPT‑4 は 39% のスコアを記録し、専門家のベースライン 70% を下回りました。2 年後、GPT‑5.2 は 92%のスコアを記録しました。モデルの推論能力と知識能力が拡張され続ける中で、科学研究を加速するモデルの能力を測定し、将来を予測するためには、より難易度の高いベンチマークが重要になります。従来の科学ベンチマークは主に選択式問題に偏っており、既に飽和状態にあるものや、科学そのものに重点が置かれていないものが大半です。

このギャップを埋めるために、私たちは FrontierScience を導入します。これは、専門家レベルの科学的能力を測定するために設計された新しいベンチマークです。FrontierScience は、物理学・化学・生物学の各分野の専門家によって作成・検証されており、難易度が高く、独創的で意義のある設計が施された数百の質問で構成されています。FrontierScience には 2 つの質問トラックがあります。オリンピアードトラックはオリンピアード形式の科学的推論能力を測定し、リサーチトラックは実際の科学研究能力を測定します。モデルの科学的能力に関する洞察を深めることは、進捗を把握し、AI による科学の加速を前進させるのに役立ちます。

初期評価において、GPT‑5.2 は FrontierScience-Olympiad で 77% のスコア、Research で 25% のスコアを記録し、他のフロンティアモデルを上回る最も高性能なモデルとなっています。専門家レベルの問題の解決において大きな進展が見られる一方で、さらなる向上の余地も残されています。特に、自由回答型のリサーチスタイルのタスクでその傾向が顕著です。科学者にとって、これは、現行モデルが構造化された推論を伴う研究の一部をすでに支援できることを示すとともに、自由回答型の思考を実行する能力を向上させるにはなお多くの作業が必要であることを示しています。これらの結果は、科学者が現在のモデルを実際に利用している方法と一致しています。すなわち、問題の設定や検証には人間の判断を依拠しつつ、研究ワークフローを加速させるために用いられ、加えて、通常であれば発見により長い時間を要するアイデアや関連性の探索にも活用されています。場合によっては、専門家が評価・検証する新たな知見の提供にもつながっています。

結局のところ、AI の科学的能力を評価する最も重要なベンチマークは、AI が生み出す新規の発見です。最終的に科学や社会にとって重要なのは、まさにそれらの発見です。FrontierScience はその前段階に位置します。これは、専門家レベルの科学的推論における指針となり、標準化された質問セットでモデルをテストし、成功・失敗の箇所を把握し、改善すべき領域を特定することを可能にします。FrontierScience は範囲が限定的で、いくつかの重要な点に制約があります（例えば、内容が限定的な、専門家が作成した問題に偏っている点など）。また、科学者が日常的に行うすべての活動を網羅しているわけではありません。しかし、この分野では、より難しく、独創的で、意義のある科学的ベンチマークが求められており、FrontierScience はその方向への一歩を提供します。

FrontierScience が測定するものと、その構築方法

FrontierScience の評価全体は 700 を超える文章形式の質問（うち 160 はゴールドセット）で構成され、物理学・化学・生物学の各サブ分野を網羅しています。このベンチマークは、オリンピアードトラックとリサーチトラックの 2 つに分かれています。FrontierScience-Olympiad には 100 の質問が含まれており、国際オリンピアードのメダリストによって作成された、制約された短答形式で科学的推論を評価するための設計となっています。オリンピアードセットは、国際オリンピアード競技の問題と同等以上の難易度を持つ理論問題を含むように設計されています。FrontierScience-Research は、博士号保持者（博士課程の学生、教授、またはポスドク研究者）が作成した 60 のオリジナル研究サブタスクで構成され、10 点評価のルーブリックで採点されます。リサーチセットは、博士号保持者が研究中に直面する可能性のある難易度の、自己完結型の複数ステップからなるサブタスクを含むように作成されています。

サンプル質問

B1 reacts with aqueous bromine (Br2) to form B2. B2 reacts with potassium nitrite (KNO2) to form B3. B3 is nitrated in nitric acid (HNO3) and sulfuric acid (H2SO4) to form B4.

B1 contains a monosubstituted aromatic 5-membered heterocycle and has a molar mass of 96.08 g/mol. It may be produced by dehydrating 5-carbon sugars (e.g. xylose) in an acid catalyst.
B2 has the molecular formula C4H2Br2O3 and contains a tetrasubstituted alkene with 2 substituents being bromines cis to each other.
B3 is a dipotassium salt with a molar mass of 269.27 g/mol. It contains 1 hydrogen.
B4 is an achiral pseudohalogen dimer with 2 carbons, no hydrogens and a molar mass of 300. g/mol.

When B4 decomposes in solution, it forms an intermediate B5 and 1 equivalent of dinitrogen tetroxide (N2O4) as a side product. Intermediate B5 can be trapped and detected as a Diels-Alder adduct.

Provide the structures of B1, B2, B3, B4, and B5 in the following format, "B1: X; B2: X; B3: X; B4: X; B5: X".

FrontierScience の各タスクは、物理学・化学・生物学の各分野の専門家によって作成・検証されています。オリンピアードセットにおいて、すべての専門家は少なくとも 1 回（多くの場合は複数回）の国際オリンピアード競技でメダルを受賞しています。リサーチセットにおいて、すべての専門家は関連分野の博士号を保持しています。

オリンピアードの質問は、関連分野の元国際メダリストやナショナルチームのコーチ 42 名と協力して作成され、合計 109 個のオリンピアードメダルに相当します。リサーチの質問は、資格を持つ科学者および各分野の専門家 45 名と協力して作成されました。すべての科学者は、博士課程の学生、ポスドク研究者、または教授のいずれかに該当します。彼らの専門分野は、量子電磁力学から合成有機化学、進化生物学に至るまで、幅広く重要な科学分野にわたっていました。

両セットのタスク作成プロセスでは、OpenAI 内部モデルに対する選別も行われました（例：モデルが正解できたタスクは除外されており、この評価は他のモデルに比べて内部モデルにやや不利になると考えられます）。オリンピアードのゴールドセット 100 問およびリサーチのゴールドセット 60 問はオープンソース化しており、その他の質問は評価の信頼性を確保するために保持しています。

タスク開発パイプラインの 4 つの段階—Creation（作成）、Review（レビュー）、Resolution（解決）、Revision（修正）—を示すフローチャート。各ステップの簡単な説明と、「Factual（事実性）」「Gradable（採点可能）」「Objective（客観性）」「Difficult（難易度）」の基準に対するチェックマークが付いています。

タスクは 4 つの段階を経ます：Creation（作成）、Review（レビュー）、Resolution（解決）、Revision（修正）。独立した専門家が互いのタスクをレビューし、基準に沿っているかを確認します。

モデルの性能評価方法

オリンピアードセットは短答形式で採点可能です。数値、式、あるいはあいまい文字列マッチのいずれかで回答を評価でき、正確性の確認に役立ちます。しかし、この正確性の確認は、問題の表現力や自由回答性とのトレードオフになることもしばしばです。リサーチセットでは、より自由度の高いタスクを採点するために、ルーブリックに基づく評価フレームワークを導入しています。各質問には、複数の独立した客観的に評価可能な項目を含む採点ルーブリックが付属しており、合計 10 点となっています。この採点ルーブリックは、最終的な回答の正確性だけでなく、中間の推論ステップの正しさも評価します。これにより、モデルの性能をより精緻に分析したり、失敗の原因を特定したりすることが可能になります。採点ルーブリックで 10 点中少なくとも 7 点が付与された場合、その解答は「正解」と見なされます。

解答は、モデルベースの採点者（GPT‑5）によって、短答形式またはルーブリック基準に基づき評価されます。理想的には各解答を専門家が採点するのが望ましいですが、この方法ではスケーラビリティに限界があるため、ルーブリックはモデルベースの採点者でも確認可能なように設計しました。ルーブリックと質問が難易度や正確性に適切に調整されていることを確認するための検証パイプラインを構築しました。

Sample chemistry research problem

Phthalocyanines are important macrocyclic compounds widely utilized in material and life sciences due to their outstanding physicochemical properties, including intense absorption in the visible-near-IR region and high electron-accepting ability. However, unlike porphyrins, the structural diversity of phthalocyanines has been limited by traditional synthetic methods that rely on one-step condensation of isoindoline units. The development of methodologies that allow controlled modification of the phthalocyanine framework, particularly at the meso-positions, represents a significant advancement in expanding the structural and electronic diversity of these compounds.

Analyze how the modification of meso-nitrogen atoms in nickel(II) phthalocyanine affects the electronic structure, aromaticity, spectroscopic properties, and reactivity of the resulting macrocycles. Explain the synthetic methodology that enables these modifications, the relationship between π-electron count and observed properties, and how these systems compare to traditional phthalocyanines and related porphyrinoids.

Sample response

1. Introduction: Why Modify the meso-Nitrogens of Phthalocyanines?

Phthalocyanines (Pc) are a class of macrocyclic dyes known for their remarkable chemical, electronic, and optical properties. Their widespread use in materials science, catalysis, and biology stems from strong absorptions in the visible/near-IR regions, exceptional redox characteristics (i.e., high electron affinity), and rigid π-conjugation. Traditional synthetic methods, however, typically form the macrocycle in a single step from phthalonitrile or isoindoline precursors, restricting the ability to introduce substitutions selectively at key ring positions-especially the "meso" (bridging) nitrogens. Recently, methodologies allowing targeted modification of the meso-nitrogen atoms in Ni(II) phthalocyanines have emerged as a powerful means to expand the structural and electronic diversity of these macrocycles.

2. Effect on Electronic Structure and Aromaticity

π-Electron Count and Ring Currents

Phthalocyanines are classical 18-electron aromatic systems, conferring high stability and planarity. Altering the meso-nitrogens through substitution or ring expansion can shift the effective π-electron count. For example, introducing additional unsaturated moieties or heteroatoms may expand the cyclic conjugation to 20 or 22 electrons, moving the molecule between Hückel aromaticity and other regimes that affect ring currents and magnetic properties.
Charge Distribution and Frontier Orbitals

Meso-substitution can tune the electron density distribution around the macrocycle. Electron-donating groups can push electron density toward the ring center, lowering the energy of the lowest unoccupied molecular orbital (LUMO). Electron-withdrawing groups, conversely, stabilize the highest occupied molecular orbital (HOMO) and shift redox potentials toward more positive values, altering both the electrochemical profile and the Q- and B-band positions in the UV-vis spectrum.

3. Spectroscopic Consequences

UV-Vis Absorption (Q and B Bands)

The principal absorption features of phthalocyanines lie in the visible (Q-band, typically 600-700 nm) and near-UV (B-band, typically 300-400 nm).

Substitution that expands the ring conjugation or introduces strong electron-donating/withdrawing groups can:
- Shift the Q-band to longer wavelengths (bathochromic shift), reaching into the near-IR, which is highly desirable for optoelectronic and photodynamic applications.
- Alter relative intensities of these bands and merge or split them, reflecting changes in orbital symmetries and energies.
NMR Spectroscopy and Aromatic Ring Currents

Modifications to the π-electron count and distribution are directly observed in 1H and 13C NMR chemical shifts.

More highly conjugated (or expanded) aromatic rings exhibit distinct downfield shifts for protons located within induced ring currents, while any partial loss of aromaticity or incorporation of antiaromatic segments can cause atypical shielding/deshielding patterns.

4. Reactivity and Coordination Chemistry

Because phthalocyanines are often used as redox catalysts or sensors, the meso-nitrogen modifications can significantly influence reactivity:

Electron-rich meso substituents facilitate nucleophilic or electrophilic attacks at the ring periphery, enabling site-selective functionalizations that are otherwise difficult.

(... shortened for the purposes of this figure)

Sample grading rubric

Analysis of Traditional Phthalocyanine Synthesis Limitations (1 point)

1.0 point: Correctly explains that traditional phthalocyanine synthesis involves one-step condensation with simultaneous formation of all meso-nitrogen bridges, providing limited control over substitution patterns at these positions.

0.5 point: Mentions limitations of traditional methods but without specific focus on meso-position control challenges.

0.0 point: Fails to identify key limitations of traditional synthetic approaches or provides incorrect analysis.

Thiolate-Mediated Tetramerization Process (1 point)

1.0 point: Correctly describes the thiolate-mediated reductive tetramerization and explains how counter cation size (K+ or Cs+ vs. Na+) affects selectivity between tetramer formation and direct macrocyclization.

0.5 point: Mentions thiolate-mediated tetramerization but without explaining factors controlling selectivity.

0.0 point: Incorrectly describes the oligomerization process or omits critical details about selectivity control.

Analysis of NMR Spectroscopic Features (1 point)

1.0 point: Correctly explains that upfield shifts in the 16π system indicate paratropic ring current (antiaromaticity), contrasts this with the broad signals in 17π systems due to paramagnetism, and connects these observations to the underlying electronic structures.

0.5 point: Identifies basic NMR patterns but without clear connection to ring currents or electronic structure.

0.0 point: Incorrectly interprets NMR data or fails to connect spectral features to electronic properties.

Electrochemical Property Analysis (1 point)

1.0 point: Correctly explains that the 16π system shows two reversible reductions reflecting conversion to 17π radical and 18π aromatic states, while 17π systems show narrow redox gaps due to facile interconversion between 16π, 17π, and 18π states, and relates these patterns to the underlying electronic structures.

0.5 point: Describes redox patterns without clearly connecting them to specific electronic state changes.

0.0 point: Incorrectly interprets electrochemical data or fails to connect redox behavior to electronic properties.

Analysis of Absorption Spectroscopy (1 point)

1.0 point: Correctly explains that the 16π system shows weak/broad absorption due to symmetry-forbidden HOMO-LUMO transitions in antiaromatic systems, while 17π systems show Q-like bands plus NIR-II absorptions characteristic of radical species, and contrasts these with typical phthalocyanine spectral features.

0.5 point: Describes absorption features but provides limited connection to underlying electronic structures.

0.0 point: Incorrectly interprets absorption data or fails to relate spectral features to electronic properties.

Reactivity Analysis of Antiaromatic System (1 point)

1.0 point: Correctly explains the high reactivity of the 16π system toward nucleophiles, details specific reactions with hydroxide (ring opening) and hydrazine (ring expansion), and explains how these transformations relieve antiaromatic destabilization.

0.5 point: Mentions reactivity but provides limited analysis of specific transformations or the driving forces behind them.

0.0 point: Incorrectly analyzes reactivity patterns or fails to connect them to the antiaromatic character of the 16π system.

(... and more)

リサーチセットの各タスクは、合計 10 点のルーブリックを用いて採点されます。このルーブリックは専門家やモデルベースの採点者のいずれでも使用可能です。モデルの評価能力を拡張するために、別のモデルを用いて解答を採点します。

モデル性能

また、複数の最先端モデルを評価しました：GPT‑5.2、Claude Opus 4.5、Gemini 3 Pro、GPT‑4o、OpenAI o4-mini、OpenAI o3 を FrontierScience-Olympiad および FrontierScience-Research で評価しました。すべての推論モデルは「高い」推論負荷で評価されましたが、GPT‑5.2 については「非常に高い」としました。初期評価において、GPT‑5.2 は FrontierScience-Olympiad で 77% のスコア、Research で 25% のスコアを記録し、他のフロンティアモデルを上回る最も高性能なモデルとなっています。Gemini 3 Pro はオリンピアードセットにおいて GPT‑5.2 と同等の性能を示し、76% のスコアを記録しました。

専門家レベルの問題の解決において大きな進展が見られ、とくに自由回答型のリサーチスタイルのタスクで顕著です。しかしまだ成長の余地は残されています。トランスクリプトの分析から、最先端モデルは推論や論理、計算で誤りを犯したり、専門的な科学概念を理解できなかったり、事実誤認を含む回答をすることがあることがわかりました。

複数の最先端モデルについて、正確性を比較します。GPT‑5.2 は、FrontierScience-Research およびオリンピアードセットで最も高性能なモデルです。

GPT‑5.2 と o3 について、推論負荷ごとの正確性を比較します。より長い思考時間は、正確性の向上につながります。

制約と今後の展開

FrontierScience は科学ベンチマークの難易度向上という点で一歩前進を示していますが、それでもなお多くの制約があります。FrontierScience は、制約された問題文を持つ質問で構成されており、最終回答の評価に重点を置くオリンピアードトラック、または研究タスクを完遂するための推論を評価するリサーチトラックで構成されています。さらに、複数の項目を含むルーブリックを長いタスクに適用する場合は、最終回答を確認するよりも客観性が低くなります。

FrontierScience は、難易度の高い専門家作成の問題に対するモデルの推論能力をより高精度に示すスナップショットを提供しますが、実際の科学研究がどのように行われているかを完全に描写するものではありません。特に、科学研究の重要な側面は評価されていません。すなわち、モデルが真に新規な仮説を生み出す方法や、動画データや実際の物理世界の実験システムなど、複数のモダリティとどのように関わるかといった部分です。

今後、科学的推論の進展は、汎用の推論システムの向上と、科学的能力の即興的活用に向けた集中的な取り組みの両方からもたらされると考えています。FrontierScience は多くのツールの 1 つに過ぎません。モデルの性能が向上するにつれ、このベンチマークを改良し、新しい領域へ拡張するとともに、これらのシステムが科学者に実際に何を可能にするかを検証する、より実世界に即した評価と組み合わせる予定です。FrontierScience のようなベンチマークは、現行 AI システムの弱点を理解するのに役立ち、科学的発見において信頼できるパートナーとなるモデルを作るための取り組みに焦点を当てることができます。

著者

OpenAI

さらに読む

すべてを表示

GPT-Red：堅牢性向上に向けた自己改善を実現

安全性2026年7月15日

コーディング評価における信号とノイズの切り分け

研究2026年7月8日

GeneBench-Pro のご紹介

研究2026年6月30日