2025년 12월 16일

AI의 과학 연구 수행 역량에 대한 평가

FrontierScience는 물리학, 화학, 생물학 전반에서 전문가 수준의 과학적 추론 능력을 평가하기 위해 새롭게 도입된 벤치마크입니다.

부드러운 연두색과 노란색 그라데이션 배경 위에 ‘Fro…’로 시작하는 큰 글자의 일부가 보입니다. 오른쪽에는 ‘Factual’, ‘Gradable’, ‘Objective’, ‘Difficult’와 같은 평가 기준이 적힌 텍스트 블록이 배치돼 있습니다.

로딩 중...

추론은 과학 연구의 핵심입니다. 과학자는 단순히 기존 정보를 떠올리는 데서 그치지 않고 가설을 수립해 검증하고 보완하며 다양한 분야의 아이디어를 종합합니다. 모델의 역량이 높아질수록 이러한 시스템이 얼마나 깊이 있는 사고를 통해 과학 연구에 기여할 수 있는지가 중요한 질문으로 떠오릅니다.

지난 1년 동안 OpenAI의 모델은 국제수학올림피아드와 국제정보올림피아드에서 금메달 수준의 성과를 거두며 중요한 이정표를 달성했습니다. 동시에 GPT‑5와 같은 최상위 모델이 실제 과학 연구 워크플로의 속도를 의미 있게 높이는 사례도 확인되고 있습니다. 연구자는 다양한 학문 분야와 언어를 넘나들며 문헌 검색이나 복잡한 수학적 증명 분석 같은 작업에 AI 시스템을 활용하며, 그 결과 며칠 또는 몇 주가 걸리던 작업이 이제는 몇 시간으로 단축되는 경우도 많습니다. 이러한 진전은 2025년 11월 공개된 논문 GPT‑5를 이용한 과학 가속화의 초기 실험⁠을 통해 발표되었습니다. 이 논문은 GPT‑5가 과학 연구의 진행 속도를 측정 가능한 수준까지 개선할 수 있다는 초기 근거를 제시합니다.

FrontierScience 소개

과학 발전을 가속화하는 것은 AI가 인류에 기여할 수 있는 가장 유망한 기회 가운데 하나입니다. 이에 따라 OpenAI는 난도가 높은 수학 및 과학 과제에서 모델의 성능을 개선하는 한편, 과학자가 이러한 모델을 최대한 활용할 수 있도록 돕는 도구도 함께 개발하고 있습니다.

2023년 11월 박사급 전문가가 작성한 ‘구글 검색으로는 풀 수 없는’ 과학 문제 벤치마크인 GPQA⁠(새 창에서 열기)가 공개되었을 당시 GPT‑4의 점수는 39%로, 전문가 기준선인 70%에 크게 못 미쳤습니다. 그러나 2년 뒤 GPT‑5.2는 92%를 기록하며 크게 향상된 성능을 보였습니다. 이렇듯 추론 능력과 지식 역량이 계속 확장됨에 따라 모델이 과학 연구를 어느 정도까지 가속화할 수 있는지 측정하고 예측하기 위해서는 더 어려운 벤치마크가 필요해졌습니다. 기존의 과학 벤치마크는 객관식 문제에 치우쳐 있거나 이미 포화 상태에 이르렀거나, 과학 연구 전반을 충분히 반영하지 못한 경우가 많았습니다.

FrontierScience는 이러한 공백을 해소하기 위해 전문가 수준의 과학 역량을 측정하도록 설계된 새로운 벤치마크입니다. 물리학 화학 생물학 전반의 전문가들이 문항을 작성하고 검증했으며, 난도가 높고 독창적이며 의미 있는 수백 개의 문제로 구성되었습니다. 이 벤치마크는 올림피아드 형식의 과학적 추론 능력을 평가하는 Olympiad와 실제 과학 연구 수행 능력을 평가하는 Research라는 두 가지 평가 트랙을 통해 모델의 역량을 측정합니다. 모델의 과학적 역량에 대한 이해를 넓히는 것은 발전 과정을 추적하는 기반이 되며, 동시에 AI를 활용한 과학 연구의 진전을 가능하게 합니다.

초기 평가 결과에 따르면 GPT‑5.2는 Olympiad에서 77%, Research에서 25%를 기록하며 최상위 AI 모델 가운데 가장 높은 성과를 보였습니다. 전문가 수준의 문제 해결 능력에서는 뚜렷한 진전이 확인됐지만, 개방형 연구 과제에서는 여전히 추가적인 발전의 여지가 남아 있는 것으로 나타났습니다. 이는 현재의 모델이 구조화된 추론이 필요한 연구의 일부는 이미 지원할 수 있지만, 개방형 추론 역량을 개선하기 위해서는 여전히 상당한 보완이 필요하다는 점을 시사합니다. 이러한 결과는 과학자들이 오늘날 AI 모델을 활용하는 방식과도 일치합니다. 과학자들은 문제 설정과 검증에서는 여전히 인간의 판단을 유지하면서 연구 과정의 속도를 높이는 데 모델을 활용하고 있습니다. 기존에는 훨씬 더 많은 시간이 필요했던 아이디어 탐색과 개념 간 연결에도 모델 활용 범위를 점차 넓혀 가고 있으며, 일부 사례에서는 모델이 제시한 새로운 통찰을 인간 전문가가 평가하고 검증하는 방식도 도입되고 있습니다.

궁극적으로 AI의 과학적 역량을 평가하는 가장 중요한 기준은 AI가 만들어 내는 새로운 발견이며, 이는 과학과 사회 전반에서 매우 중요한 의미를 지닙니다. FrontierScience는 그 목표에 이르기까지 필요한 역량을 측정하는 벤치마크입니다. FrontierScience는 전문가 수준의 과학적 추론을 평가하기 위한 기준점으로서, 표준화된 문항 집합을 통해 모델의 성과와 한계를 드러내며 개선이 필요한 지점을 파악할 수 있게 합니다. 다만 범위가 제한적이고 몇 가지 중요한 측면에서 한계도 존재합니다. 대표적으로 전문가가 작성한 제한된 형태의 문제에 초점을 두고 있어 과학자들이 일상적인 연구에서 수행하는 모든 활동을 포괄하지는 못합니다. 하지만 이 분야에는 더 어렵고 독창적이고 의미 있는 과학 벤치마크가 필요하다는 점은 분명하며 FrontierScience는 이러한 방향으로 나아가기 위한 하나의 진전을 보여줍니다.

FrontierScience의 평가 내용과 구축 과정

FrontierScience의 전체 평가는 물리학 화학 생물학 전반의 세부 분야를 아우르는 700개 이상의 텍스트 문항으로 구성되어 있으며, 이 가운데 160개가 골드 세트에 해당합니다. 벤치마크는 Olympiad와 Research라는 두 가지 부문으로 나뉩니다. FrontierScience-Olympiad는 국제 올림피아드 메달 수상자가 설계한 100개 문항으로 구성되며, 제한된 단답형 형식에서 과학적 추론 능력을 평가합니다. 이론 중심의 Olympiad 문항은 국제 과학 올림피아드 대회 문제와 최소한 동등한 수준의 난도를 갖도록 설계되었습니다. FrontierScience-Research는 박사 과정 연구자, 교수, 박사 후 연구원 등 박사급 과학자가 설계한 60개의 독창적인 연구 과제로 구성되며, 각 과제는 10점 만점의 채점 기준표를 통해 평가됩니다. Research 세트는 박사급 연구자가 실제 연구 과정에서 마주할 수 있는 수준의 난도를 반영해, 독립적으로 해결 가능한 다단계 과제로 설계되었습니다.

문항 샘플

B1 reacts with aqueous bromine (Br2) to form B2. B2 reacts with potassium nitrite (KNO2) to form B3. B3 is nitrated in nitric acid (HNO3) and sulfuric acid (H2SO4) to form B4.

B1 contains a monosubstituted aromatic 5-membered heterocycle and has a molar mass of 96.08 g/mol. It may be produced by dehydrating 5-carbon sugars (e.g. xylose) in an acid catalyst.
B2 has the molecular formula C4H2Br2O3 and contains a tetrasubstituted alkene with 2 substituents being bromines cis to each other.
B3 is a dipotassium salt with a molar mass of 269.27 g/mol. It contains 1 hydrogen.
B4 is an achiral pseudohalogen dimer with 2 carbons, no hydrogens and a molar mass of 300. g/mol.

When B4 decomposes in solution, it forms an intermediate B5 and 1 equivalent of dinitrogen tetroxide (N2O4) as a side product. Intermediate B5 can be trapped and detected as a Diels-Alder adduct.

Provide the structures of B1, B2, B3, B4, and B5 in the following format, "B1: X; B2: X; B3: X; B4: X; B5: X".

FrontierScience의 모든 문항은 물리학, 화학, 생물학 분야의 전문가가 작성하고 검증했습니다. Olympiad 세트의 모든 참여 전문가는 최소 한 차례 이상(대다수 여러 차례) 국제 올림피아드 대회에서 메달을 수상한 경력을 보유하고 있습니다. Research 세트의 모든 전문가는 관련 분야의 박사 학위를 보유하고 있습니다.

Olympiad 문항은 해당 분야의 국제 올림피아드 메달 수상자 또는 전직 국가 대표팀 코치 42명과의 협업을 통해 제작되었습니다. 이들이 보유한 올림피아드 메달 수는 총 109개에 이릅니다. Research 문항은 자격을 갖춘 45명의 과학자 및 분야별 전문가와 협력해 제작되었으며, 참여자는 박사 과정 연구자, 박사 후 연구원, 교수 등 모두 박사급 과학자들이었습니다. 이들의 전문 분야는 양자 전기역학, 합성 유기화학, 진화생물학 등 다양한 핵심 과학 분야에 걸쳐 있습니다.

두 세트의 문항 제작 과정에는 OpenAI 내부 모델을 기준으로 한 선별 절차도 포함되었습니다. 이 과정에서 모델이 이미 정확히 해결한 문항은 제외되었으며, 그 결과 본 평가는 다른 AI 모델과 비교했을 때 OpenAI 모델에 다소 불리하게 작용할 가능성이 있습니다. Olympiad 골드 세트 100문항과 Research 골드 세트 60문항은 오픈소스로 공개되며, 나머지 문항은 오염 여부를 추적하기 위해 비공개로 유지됩니다.

문항 개발 과정을 네 단계로 나눈 흐름도입니다. Creation, Review, Resolution, Revision 단계가 순서대로 표시되어 있고, 각 단계에는 간단한 설명과 함께 사실성, 채점 가능성, 객관성, 난도 기준을 충족했음을 나타내는 체크 표시가 있습니다.

문항 개발은 제작, 검토, 조율, 수정의 네 단계를 거칩니다. 개발 과정에서 독립적인 전문가들이 서로의 문항을 검토해 평가 기준에 부합하는지 확인합니다.

모델 성능을 채점하는 방식

단답형 문항으로 구성된 Olympiad 세트는 숫자, 수식, 유사 문자열 매칭 등을 통해 정답 여부를 비교적 명확하게 검증할 수 있습니다. 다만 이러한 방식은 문제의 표현력이나 개방성을 일정 부분 제한한다는 한계가 있습니다. Research 세트는 보다 개방적인 문항을 평가하기 위해 루브릭 기반 채점 방식을 사용합니다. 각 문항에는 객관적으로 평가 가능한 여러 개의 독립적인 항목으로 구성된 채점 기준표가 포함되며, 총점은 10점입니다. 이 루브릭은 최종 답의 정확성뿐만 아니라 중간 추론 단계의 타당성도 함께 평가해 모델의 성능과 실패 양상을 보다 정밀하게 분석할 수 있도록 합니다. 채점 결과 10점 만점에 7점 이상을 받은 해답은 ‘정답’으로 간주됩니다.

응답은 단답형 정답 또는 루브릭 기준에 따라 모델 기반 채점기인 GPT‑5가 평가합니다. 이상적으로는 각 응답을 전문가가 직접 채점하는 것이 바람직하지만, 이 방식은 확장성이 낮기 때문에 모델 채점기를 이용해 검증할 수 있도록 루브릭을 설계했습니다. 이와 함께 루브릭과 문항이 적절한 난도와 정확성으로 설계되었는지를 검증하기 위해 별도의 검증 파이프라인도 구축했습니다.

Sample chemistry research problem

Phthalocyanines are important macrocyclic compounds widely utilized in material and life sciences due to their outstanding physicochemical properties, including intense absorption in the visible-near-IR region and high electron-accepting ability. However, unlike porphyrins, the structural diversity of phthalocyanines has been limited by traditional synthetic methods that rely on one-step condensation of isoindoline units. The development of methodologies that allow controlled modification of the phthalocyanine framework, particularly at the meso-positions, represents a significant advancement in expanding the structural and electronic diversity of these compounds.

Analyze how the modification of meso-nitrogen atoms in nickel(II) phthalocyanine affects the electronic structure, aromaticity, spectroscopic properties, and reactivity of the resulting macrocycles. Explain the synthetic methodology that enables these modifications, the relationship between π-electron count and observed properties, and how these systems compare to traditional phthalocyanines and related porphyrinoids.

Sample response

1. Introduction: Why Modify the meso-Nitrogens of Phthalocyanines?

Phthalocyanines (Pc) are a class of macrocyclic dyes known for their remarkable chemical, electronic, and optical properties. Their widespread use in materials science, catalysis, and biology stems from strong absorptions in the visible/near-IR regions, exceptional redox characteristics (i.e., high electron affinity), and rigid π-conjugation. Traditional synthetic methods, however, typically form the macrocycle in a single step from phthalonitrile or isoindoline precursors, restricting the ability to introduce substitutions selectively at key ring positions-especially the "meso" (bridging) nitrogens. Recently, methodologies allowing targeted modification of the meso-nitrogen atoms in Ni(II) phthalocyanines have emerged as a powerful means to expand the structural and electronic diversity of these macrocycles.

2. Effect on Electronic Structure and Aromaticity

π-Electron Count and Ring Currents

Phthalocyanines are classical 18-electron aromatic systems, conferring high stability and planarity. Altering the meso-nitrogens through substitution or ring expansion can shift the effective π-electron count. For example, introducing additional unsaturated moieties or heteroatoms may expand the cyclic conjugation to 20 or 22 electrons, moving the molecule between Hückel aromaticity and other regimes that affect ring currents and magnetic properties.
Charge Distribution and Frontier Orbitals

Meso-substitution can tune the electron density distribution around the macrocycle. Electron-donating groups can push electron density toward the ring center, lowering the energy of the lowest unoccupied molecular orbital (LUMO). Electron-withdrawing groups, conversely, stabilize the highest occupied molecular orbital (HOMO) and shift redox potentials toward more positive values, altering both the electrochemical profile and the Q- and B-band positions in the UV-vis spectrum.

3. Spectroscopic Consequences

UV-Vis Absorption (Q and B Bands)

The principal absorption features of phthalocyanines lie in the visible (Q-band, typically 600-700 nm) and near-UV (B-band, typically 300-400 nm).

Substitution that expands the ring conjugation or introduces strong electron-donating/withdrawing groups can:
- Shift the Q-band to longer wavelengths (bathochromic shift), reaching into the near-IR, which is highly desirable for optoelectronic and photodynamic applications.
- Alter relative intensities of these bands and merge or split them, reflecting changes in orbital symmetries and energies.
NMR Spectroscopy and Aromatic Ring Currents

Modifications to the π-electron count and distribution are directly observed in 1H and 13C NMR chemical shifts.

More highly conjugated (or expanded) aromatic rings exhibit distinct downfield shifts for protons located within induced ring currents, while any partial loss of aromaticity or incorporation of antiaromatic segments can cause atypical shielding/deshielding patterns.

4. Reactivity and Coordination Chemistry

Because phthalocyanines are often used as redox catalysts or sensors, the meso-nitrogen modifications can significantly influence reactivity:

Electron-rich meso substituents facilitate nucleophilic or electrophilic attacks at the ring periphery, enabling site-selective functionalizations that are otherwise difficult.

(... shortened for the purposes of this figure)

Sample grading rubric

Analysis of Traditional Phthalocyanine Synthesis Limitations (1 point)

1.0 point: Correctly explains that traditional phthalocyanine synthesis involves one-step condensation with simultaneous formation of all meso-nitrogen bridges, providing limited control over substitution patterns at these positions.

0.5 point: Mentions limitations of traditional methods but without specific focus on meso-position control challenges.

0.0 point: Fails to identify key limitations of traditional synthetic approaches or provides incorrect analysis.

Thiolate-Mediated Tetramerization Process (1 point)

1.0 point: Correctly describes the thiolate-mediated reductive tetramerization and explains how counter cation size (K+ or Cs+ vs. Na+) affects selectivity between tetramer formation and direct macrocyclization.

0.5 point: Mentions thiolate-mediated tetramerization but without explaining factors controlling selectivity.

0.0 point: Incorrectly describes the oligomerization process or omits critical details about selectivity control.

Analysis of NMR Spectroscopic Features (1 point)

1.0 point: Correctly explains that upfield shifts in the 16π system indicate paratropic ring current (antiaromaticity), contrasts this with the broad signals in 17π systems due to paramagnetism, and connects these observations to the underlying electronic structures.

0.5 point: Identifies basic NMR patterns but without clear connection to ring currents or electronic structure.

0.0 point: Incorrectly interprets NMR data or fails to connect spectral features to electronic properties.

Electrochemical Property Analysis (1 point)

1.0 point: Correctly explains that the 16π system shows two reversible reductions reflecting conversion to 17π radical and 18π aromatic states, while 17π systems show narrow redox gaps due to facile interconversion between 16π, 17π, and 18π states, and relates these patterns to the underlying electronic structures.

0.5 point: Describes redox patterns without clearly connecting them to specific electronic state changes.

0.0 point: Incorrectly interprets electrochemical data or fails to connect redox behavior to electronic properties.

Analysis of Absorption Spectroscopy (1 point)

1.0 point: Correctly explains that the 16π system shows weak/broad absorption due to symmetry-forbidden HOMO-LUMO transitions in antiaromatic systems, while 17π systems show Q-like bands plus NIR-II absorptions characteristic of radical species, and contrasts these with typical phthalocyanine spectral features.

0.5 point: Describes absorption features but provides limited connection to underlying electronic structures.

0.0 point: Incorrectly interprets absorption data or fails to relate spectral features to electronic properties.

Reactivity Analysis of Antiaromatic System (1 point)

1.0 point: Correctly explains the high reactivity of the 16π system toward nucleophiles, details specific reactions with hydroxide (ring opening) and hydrazine (ring expansion), and explains how these transformations relieve antiaromatic destabilization.

0.5 point: Mentions reactivity but provides limited analysis of specific transformations or the driving forces behind them.

0.0 point: Incorrectly analyzes reactivity patterns or fails to connect them to the antiaromatic character of the 16π system.

(... and more)

Research 세트의 각 문항은 전문가나 모델 채점기가 사용할 수 있는 10점 만점의 루브릭을 기준으로 평가됩니다. 모델 평가를 확장 가능한 방식으로 운영하기 위해 응답 채점에는 별도의 모델을 활용합니다.

모델 성능

OpenAI 팀은 FrontierScience-Olympiad와 FrontierScience-Research를 기준으로 GPT‑5.2, Claude Opus 4.5, Gemini 3 Pro, GPT‑4o, OpenAI o4-mini, OpenAI o3 등 여러 최상위 AI 모델을 평가했습니다. GPT‑5.2를 제외한 모든 추론 모델에는 'high' 추론 수준을 적용했으며, GPT‑5.2에는 'xhigh' 추론 수준을 적용했습니다. 초기 평가 결과에 따르면 GPT‑5.2는 Olympiad에서 77%, Research에서 25%를 기록하며 최상위 AI 모델 가운데 가장 높은 성과를 보였습니다. Gemini 3 Pro는 Olympiad 세트에서 76%를 기록하며 GPT‑5.2와 유사한 성능을 나타냈습니다.

전문가 수준의 문제 해결 능력에서는 특히 개방형 연구 과제를 중심으로 의미 있는 진전이 확인되었습니다. 다만 여전히 개선의 여지는 남아 있습니다. 실패 사례를 분석한 결과, 최상위 모델들도 추론·논리·계산 오류를 보이거나 특수한 과학 개념을 이해하지 못하고, 사실과 다른 내용을 생성하는 경우가 있었습니다.

여러 최상위 AI 모델의 정확도를 비교한 결과, GPT‑5.2가 FrontierScience-Research와 Olympiad 세트 모두에서 가장 높은 성능을 보였습니다.

GPT‑5.2와 OpenAI o3 모델을 대상으로 추론 수준에 따른 정확도를 비교한 결과, 더 긴 사고 시간이 더 높은 정확도로 이어지는 경향이 나타났습니다.

한계와 향후 과제

FrontierScience는 과학 벤치마크의 난도를 한 단계 끌어올렸다는 점에서 의미 있는 진전이지만, 여전히 여러 한계를 지니고 있습니다. FrontierScience는 문제 진술이 비교적 제한된 문항으로 구성되어 있으며 Olympiad에서는 최종 답안을, Research에서는 연구 과제를 수행하기 위한 추론을 평가하는 데 초점을 둡니다. 또한 여러 평가 항목으로 구성된 루브릭을 장문의 과제에 적용하는 방식은 최종 답안만 확인하는 경우에 비해 객관성이 떨어질 수 있습니다.

FrontierScience는 전문가가 작성한 고난도 문항을 통해 모델의 추론 능력을 보다 정밀하게 살펴볼 수 있게 해 주지만, 실제 과학 연구가 이루어지는 전반적인 과정을 모두 포착하지는 못합니다. 특히 모델이 진정으로 새로운 가설을 생성하는 방식이나 영상 데이터 및 실제 물리적 실험 시스템을 포함한 다양한 모달리티와 상호작용하는 과정 등 과학 연구에서 중요한 요소들은 평가 대상에 포함되어 있지 않습니다.

앞으로 과학적 추론 역량은 범용 추론 시스템을 개선하는 노력과 과학 분야에 특화된 역량을 강화하는 시도가 함께 이어지면서 발전할 것으로 기대됩니다. FrontierScience는 이러한 흐름 속에서 활용되는 여러 도구 가운데 하나에 불과합니다. OpenAI는 모델이 발전함에 따라 이 벤치마크 역시 반복적으로 개선하고 새로운 분야로 확장할 계획이며, 이러한 벤치마크를 실제 연구 환경을 반영한 평가와 결합해 AI 시스템이 과학자에게 실제로 어떤 도움을 주는지 더 잘 살펴보고자 합니다. FrontierScience와 같은 벤치마크는 오늘날 AI 시스템의 약점을 파악하고, 모델을 과학적 발견에서 신뢰할 수 있는 협력자로 발전시키는 데 필요한 방향을 제시하는 역할을 하게 될 것입니다.

작성자

OpenAI

더 읽어보기

모두 보기

GPT-Red: 자기 개선을 통한 견고성 강화

안전2026년 7월 15일

코딩 평가에서 유의미한 신호와 노이즈 구분하기

리서치2026년 7월 8일

GeneBench-Pro 소개

리서치2026년 6월 30일